This project explores the prediction of loan default status using a machine learning classification approach. The notebook includes data preprocessing, exploratory data analysis, feature engineering, and the application of classification algorithms such as Logistic Regression and Random Forest. The goal is to accurately classify whether a loan applicant is likely to default, leveraging real-world financial data and evaluating model performance through metrics like accuracy and confusion matrix. The project provides insights into important predictive features and demonstrates a practical workflow for tackling binary classification problems in the financial domain.
knitr::opts_chunk$set(warning = FALSE, message = FALSE, cache=TRUE)
install.packages(c("dplyr", "tidyr", "stringr", "formattable", "ggcorrplot", "ggplot2", "lubridate", "gridExtra", "readr", "scales", "fastDummies", "plotly", "reshape2", "randomForest", "caret", "RANN", "e1071", "MLmetrics", "xgboost", "DMwR2", "smotefamily", "glmnet", "rpart", "caTools", "progress", "stats"))
## Error in contrib.url(repos, "source"): trying to use CRAN without setting a mirror
options(repos = c(CRAN = "https://cloud.r-project.org"))
library(dplyr)
library(tidyr)
library(stringr)
library(formattable)
library(ggcorrplot)
suppressWarnings({
library(ggplot2)
library(lubridate)
library(gridExtra)
})
library(readr)
library(scales)
library(fastDummies)
library(plotly)
library(reshape2)
library(randomForest)
library(caret)
library(RANN)
library(e1071)
library(MLmetrics)
library(xgboost)
library(DMwR2)
library(smotefamily)
library(glmnet)
library(rpart)
library(caTools)
library(progress)
library(stats)
train <- read_csv("train.csv")
test <- read.csv("test.csv")
df <- bind_rows(train, test)
cat("Shape of training dataframe: ", dim(train), "\n")
## Shape of training dataframe: 233154 41
cat("Shape of testing dataframe: ", dim(test), "\n")
## Shape of testing dataframe: 112392 40
train <- train[!duplicated(train), ]
test <- test[!duplicated(test), ]
cat("Shape of training dataframe after removing duplicates: ", dim(train), "\n")
## Shape of training dataframe after removing duplicates: 233154 41
cat("Shape of testing dataframe after removing duplicates: ", dim(test), "\n")
## Shape of testing dataframe after removing duplicates: 112392 40
cat("Names of columns: ", colnames(train), "\n")
## Names of columns: UNIQUEID DISBURSED_AMOUNT ASSET_COST LTV BRANCH_ID SUPPLIER_ID MANUFACTURER_ID CURRENT_PINCODE_ID DATE_OF_BIRTH EMPLOYMENT_TYPE DISBURSAL_DATE STATE_ID EMPLOYEE_CODE_ID MOBILENO_AVL_FLAG AADHAR_FLAG PAN_FLAG VOTERID_FLAG DRIVING_FLAG PASSPORT_FLAG PERFORM_CNS_SCORE PERFORM_CNS_SCORE_DESCRIPTION PRI_NO_OF_ACCTS PRI_ACTIVE_ACCTS PRI_OVERDUE_ACCTS PRI_CURRENT_BALANCE PRI_SANCTIONED_AMOUNT PRI_DISBURSED_AMOUNT SEC_NO_OF_ACCTS SEC_ACTIVE_ACCTS SEC_OVERDUE_ACCTS SEC_CURRENT_BALANCE SEC_SANCTIONED_AMOUNT SEC_DISBURSED_AMOUNT PRIMARY_INSTAL_AMT SEC_INSTAL_AMT NEW_ACCTS_IN_LAST_SIX_MONTHS DELINQUENT_ACCTS_IN_LAST_SIX_MONTHS AVERAGE_ACCT_AGE CREDIT_HISTORY_LENGTH NO_OF_INQUIRIES LOAN_DEFAULT
We have 41 variables, no duplicates were detected. Overall, we have over 230 thousand observations in our training set. Let’s check whether the dataset has any missing values.
total <- nrow(train)
missing_data <- train %>%
summarise(across(everything(), ~sum(is.na(.)))) %>%
pivot_longer(everything(), names_to = "column name", values_to = "Total missing") %>%
mutate(`Percent missing` = (`Total missing` / total) * 100) %>%
arrange(desc(`Total missing`))
print(missing_data)
## # A tibble: 41 × 3
## `column name` `Total missing` `Percent missing`
## <chr> <int> <dbl>
## 1 EMPLOYMENT_TYPE 7661 3.29
## 2 UNIQUEID 0 0
## 3 DISBURSED_AMOUNT 0 0
## 4 ASSET_COST 0 0
## 5 LTV 0 0
## 6 BRANCH_ID 0 0
## 7 SUPPLIER_ID 0 0
## 8 MANUFACTURER_ID 0 0
## 9 CURRENT_PINCODE_ID 0 0
## 10 DATE_OF_BIRTH 0 0
## # ℹ 31 more rows
We can see that more that 3% of the ‘EMPLOYMENT_TYPE’ variable’s values is missing. We will treat the NA values as not employed.
print(unique(train$EMPLOYMENT_TYPE))
## [1] "Salaried" "Self employed" NA
length(unique(train$EMPLOYMENT_TYPE))
## [1] 3
Let’s analyze the structure of the data set. We can see that most of the variables are numeric. We see that ‘AVERAGE_ACCT_AGE’ and ‘CREDIT_HISTORY_LENGHT’ can be changed from characters to numbers.
str(train)
## tibble [233,154 × 41] (S3: tbl_df/tbl/data.frame)
## $ UNIQUEID : num [1:233154] 420825 537409 417566 624493 539055 ...
## $ DISBURSED_AMOUNT : num [1:233154] 50578 47145 53278 57513 52378 ...
## $ ASSET_COST : num [1:233154] 58400 65550 61360 66113 60300 ...
## $ LTV : num [1:233154] 89.5 73.2 89.6 88.5 88.4 ...
## $ BRANCH_ID : num [1:233154] 67 67 67 67 67 67 67 67 67 67 ...
## $ SUPPLIER_ID : num [1:233154] 22807 22807 22807 22807 22807 ...
## $ MANUFACTURER_ID : num [1:233154] 45 45 45 45 45 45 45 45 45 45 ...
## $ CURRENT_PINCODE_ID : num [1:233154] 1441 1502 1497 1501 1495 ...
## $ DATE_OF_BIRTH : chr [1:233154] "01-01-1984" "31-07-1985" "24-08-1985" "30-12-1993" ...
## $ EMPLOYMENT_TYPE : chr [1:233154] "Salaried" "Self employed" "Self employed" "Self employed" ...
## $ DISBURSAL_DATE : chr [1:233154] "03-08-2018" "26-09-2018" "01-08-2018" "26-10-2018" ...
## $ STATE_ID : num [1:233154] 6 6 6 6 6 6 6 6 6 6 ...
## $ EMPLOYEE_CODE_ID : num [1:233154] 1998 1998 1998 1998 1998 ...
## $ MOBILENO_AVL_FLAG : num [1:233154] 1 1 1 1 1 1 1 1 1 1 ...
## $ AADHAR_FLAG : num [1:233154] 1 1 1 1 1 1 1 1 1 0 ...
## $ PAN_FLAG : num [1:233154] 0 0 0 0 0 0 0 0 0 0 ...
## $ VOTERID_FLAG : num [1:233154] 0 0 0 0 0 0 0 0 0 1 ...
## $ DRIVING_FLAG : num [1:233154] 0 0 0 0 0 0 0 0 0 0 ...
## $ PASSPORT_FLAG : num [1:233154] 0 0 0 0 0 0 0 0 0 0 ...
## $ PERFORM_CNS_SCORE : num [1:233154] 0 598 0 305 0 825 0 17 718 818 ...
## $ PERFORM_CNS_SCORE_DESCRIPTION : chr [1:233154] "No Bureau History Available" "I-Medium Risk" "No Bureau History Available" "L-Very High Risk" ...
## $ PRI_NO_OF_ACCTS : num [1:233154] 0 1 0 3 0 2 0 1 1 1 ...
## $ PRI_ACTIVE_ACCTS : num [1:233154] 0 1 0 0 0 0 0 1 1 0 ...
## $ PRI_OVERDUE_ACCTS : num [1:233154] 0 1 0 0 0 0 0 0 0 0 ...
## $ PRI_CURRENT_BALANCE : num [1:233154] 0 27600 0 0 0 ...
## $ PRI_SANCTIONED_AMOUNT : num [1:233154] 0 50200 0 0 0 ...
## $ PRI_DISBURSED_AMOUNT : num [1:233154] 0 50200 0 0 0 ...
## $ SEC_NO_OF_ACCTS : num [1:233154] 0 0 0 0 0 0 0 0 0 0 ...
## $ SEC_ACTIVE_ACCTS : num [1:233154] 0 0 0 0 0 0 0 0 0 0 ...
## $ SEC_OVERDUE_ACCTS : num [1:233154] 0 0 0 0 0 0 0 0 0 0 ...
## $ SEC_CURRENT_BALANCE : num [1:233154] 0 0 0 0 0 0 0 0 0 0 ...
## $ SEC_SANCTIONED_AMOUNT : num [1:233154] 0 0 0 0 0 0 0 0 0 0 ...
## $ SEC_DISBURSED_AMOUNT : num [1:233154] 0 0 0 0 0 0 0 0 0 0 ...
## $ PRIMARY_INSTAL_AMT : num [1:233154] 0 1991 0 31 0 ...
## $ SEC_INSTAL_AMT : num [1:233154] 0 0 0 0 0 0 0 0 0 0 ...
## $ NEW_ACCTS_IN_LAST_SIX_MONTHS : num [1:233154] 0 0 0 0 0 0 0 0 0 0 ...
## $ DELINQUENT_ACCTS_IN_LAST_SIX_MONTHS: num [1:233154] 0 1 0 0 0 0 0 0 0 0 ...
## $ AVERAGE_ACCT_AGE : chr [1:233154] "0yrs 0mon" "1yrs 11mon" "0yrs 0mon" "0yrs 8mon" ...
## $ CREDIT_HISTORY_LENGTH : chr [1:233154] "0yrs 0mon" "1yrs 11mon" "0yrs 0mon" "1yrs 3mon" ...
## $ NO_OF_INQUIRIES : num [1:233154] 0 0 0 1 1 0 0 0 1 0 ...
## $ LOAN_DEFAULT : num [1:233154] 0 1 0 1 1 0 0 0 0 0 ...
parse_age <- function(x) {
years <- as.numeric(str_extract(x, "\\d+(?=yrs)"))
months <- as.numeric(str_extract(x, "\\d+(?=mon)"))
years[is.na(years)] <- 0
months[is.na(months)] <- 0
years + months / 12
}
train <- train %>%
mutate(AVERAGE_ACCT_AGE = parse_age(AVERAGE_ACCT_AGE),
CREDIT_HISTORY_LENGTH = parse_age(CREDIT_HISTORY_LENGTH))
test <- test %>%
mutate(AVERAGE_ACCT_AGE = parse_age(AVERAGE_ACCT_AGE),
CREDIT_HISTORY_LENGTH = parse_age(CREDIT_HISTORY_LENGTH))
Additionally, we changed the format of variables ‘DATE_OF_BIRTH’ and ‘DISBURSAL_DATE’ form object to date.
train$DATE_OF_BIRTH <- as.Date(train$DATE_OF_BIRTH, format = "%d-%m-%Y")
test$DATE_OF_BIRTH <- as.Date(test$DATE_OF_BIRTH, format = "%d-%m-%Y")
train$DISBURSAL_DATE <- as.Date(train$DISBURSAL_DATE, format = "%d-%m-%Y")
test$DISBURSAL_DATE <- as.Date(test$DISBURSAL_DATE, format = "%d-%m-%Y")
Finally, we obtained the dataset shaped like this:
str(train)
## tibble [233,154 × 41] (S3: tbl_df/tbl/data.frame)
## $ UNIQUEID : num [1:233154] 420825 537409 417566 624493 539055 ...
## $ DISBURSED_AMOUNT : num [1:233154] 50578 47145 53278 57513 52378 ...
## $ ASSET_COST : num [1:233154] 58400 65550 61360 66113 60300 ...
## $ LTV : num [1:233154] 89.5 73.2 89.6 88.5 88.4 ...
## $ BRANCH_ID : num [1:233154] 67 67 67 67 67 67 67 67 67 67 ...
## $ SUPPLIER_ID : num [1:233154] 22807 22807 22807 22807 22807 ...
## $ MANUFACTURER_ID : num [1:233154] 45 45 45 45 45 45 45 45 45 45 ...
## $ CURRENT_PINCODE_ID : num [1:233154] 1441 1502 1497 1501 1495 ...
## $ DATE_OF_BIRTH : Date[1:233154], format: "1984-01-01" "1985-07-31" ...
## $ EMPLOYMENT_TYPE : chr [1:233154] "Salaried" "Self employed" "Self employed" "Self employed" ...
## $ DISBURSAL_DATE : Date[1:233154], format: "2018-08-03" "2018-09-26" ...
## $ STATE_ID : num [1:233154] 6 6 6 6 6 6 6 6 6 6 ...
## $ EMPLOYEE_CODE_ID : num [1:233154] 1998 1998 1998 1998 1998 ...
## $ MOBILENO_AVL_FLAG : num [1:233154] 1 1 1 1 1 1 1 1 1 1 ...
## $ AADHAR_FLAG : num [1:233154] 1 1 1 1 1 1 1 1 1 0 ...
## $ PAN_FLAG : num [1:233154] 0 0 0 0 0 0 0 0 0 0 ...
## $ VOTERID_FLAG : num [1:233154] 0 0 0 0 0 0 0 0 0 1 ...
## $ DRIVING_FLAG : num [1:233154] 0 0 0 0 0 0 0 0 0 0 ...
## $ PASSPORT_FLAG : num [1:233154] 0 0 0 0 0 0 0 0 0 0 ...
## $ PERFORM_CNS_SCORE : num [1:233154] 0 598 0 305 0 825 0 17 718 818 ...
## $ PERFORM_CNS_SCORE_DESCRIPTION : chr [1:233154] "No Bureau History Available" "I-Medium Risk" "No Bureau History Available" "L-Very High Risk" ...
## $ PRI_NO_OF_ACCTS : num [1:233154] 0 1 0 3 0 2 0 1 1 1 ...
## $ PRI_ACTIVE_ACCTS : num [1:233154] 0 1 0 0 0 0 0 1 1 0 ...
## $ PRI_OVERDUE_ACCTS : num [1:233154] 0 1 0 0 0 0 0 0 0 0 ...
## $ PRI_CURRENT_BALANCE : num [1:233154] 0 27600 0 0 0 ...
## $ PRI_SANCTIONED_AMOUNT : num [1:233154] 0 50200 0 0 0 ...
## $ PRI_DISBURSED_AMOUNT : num [1:233154] 0 50200 0 0 0 ...
## $ SEC_NO_OF_ACCTS : num [1:233154] 0 0 0 0 0 0 0 0 0 0 ...
## $ SEC_ACTIVE_ACCTS : num [1:233154] 0 0 0 0 0 0 0 0 0 0 ...
## $ SEC_OVERDUE_ACCTS : num [1:233154] 0 0 0 0 0 0 0 0 0 0 ...
## $ SEC_CURRENT_BALANCE : num [1:233154] 0 0 0 0 0 0 0 0 0 0 ...
## $ SEC_SANCTIONED_AMOUNT : num [1:233154] 0 0 0 0 0 0 0 0 0 0 ...
## $ SEC_DISBURSED_AMOUNT : num [1:233154] 0 0 0 0 0 0 0 0 0 0 ...
## $ PRIMARY_INSTAL_AMT : num [1:233154] 0 1991 0 31 0 ...
## $ SEC_INSTAL_AMT : num [1:233154] 0 0 0 0 0 0 0 0 0 0 ...
## $ NEW_ACCTS_IN_LAST_SIX_MONTHS : num [1:233154] 0 0 0 0 0 0 0 0 0 0 ...
## $ DELINQUENT_ACCTS_IN_LAST_SIX_MONTHS: num [1:233154] 0 1 0 0 0 0 0 0 0 0 ...
## $ AVERAGE_ACCT_AGE : num [1:233154] 0 1.917 0 0.667 0 ...
## $ CREDIT_HISTORY_LENGTH : num [1:233154] 0 1.92 0 1.25 0 ...
## $ NO_OF_INQUIRIES : num [1:233154] 0 0 0 1 1 0 0 0 1 0 ...
## $ LOAN_DEFAULT : num [1:233154] 0 1 0 1 1 0 0 0 0 0 ...
Now we can proceed with the EDA. Let’s start with
We will analyze the distribution of the target variable ‘LOAN_DEFAULT’.
class_df <- train %>%
group_by(LOAN_DEFAULT) %>%
summarise(UNIQUEID_count = n()) %>%
arrange(desc(UNIQUEID_count))
formattable(class_df, list( UNIQUEID_count = color_bar("lightgreen") ))
| LOAN_DEFAULT | UNIQUEID_count |
|---|---|
| 0 | 182543 |
| 1 | 50611 |
colors <- c("0" = "deepskyblue", "1" = "deeppink")
ggplot(train, aes(x = factor(LOAN_DEFAULT), fill = factor(LOAN_DEFAULT))) +
geom_bar() +
scale_fill_manual(values = colors) +
labs(title = "Class Distribution", x = "Loan Default", y = "Count") +
theme_minimal()
count_default_0 <- sum(train$LOAN_DEFAULT == 0, na.rm = TRUE)
count_default_1 <- sum(train$LOAN_DEFAULT == 1, na.rm = TRUE)
total <- count_default_0 + count_default_1
percentage_0 <- (count_default_0 / total) * 100
percentage_1 <- (count_default_1 / total) * 100
cat(sprintf("%% of no defaults : %.2f%%\n", percentage_0))
## % of no defaults : 78.29%
cat(sprintf("Number of no defaults : %d\n", count_default_0))
## Number of no defaults : 182543
cat(sprintf("%% of defaults : %.2f%%\n", percentage_1))
## % of defaults : 21.71%
cat(sprintf("Number of defaults : %d\n", count_default_1))
## Number of defaults : 50611
Also, let’s analyze the distribution between other categorical variables, such as ‘EMPLOYMENT_TYPE’, ‘MOBILENO_AVL_FLAG’, ‘AADHAR_FLAG’, ‘PAN_FLAG’, ‘VOTERID_FLAG’, ‘DRIVING_FLAG’ and ‘PASSPORT_FLAG’ ‘EMPLOYMENT_TYPE’
train %>%
group_by(EMPLOYMENT_TYPE, LOAN_DEFAULT) %>%
summarise(count = n(), .groups = 'drop') %>%
group_by(EMPLOYMENT_TYPE) %>%
mutate(prop = count / sum(count)) %>%
arrange(EMPLOYMENT_TYPE, LOAN_DEFAULT) %>%
print(n = Inf)
## # A tibble: 6 × 4
## # Groups: EMPLOYMENT_TYPE [3]
## EMPLOYMENT_TYPE LOAN_DEFAULT count prop
## <chr> <dbl> <int> <dbl>
## 1 Salaried 0 77948 0.797
## 2 Salaried 1 19910 0.203
## 3 Self employed 0 98578 0.772
## 4 Self employed 1 29057 0.228
## 5 <NA> 0 6017 0.785
## 6 <NA> 1 1644 0.215
‘MOBILENO_AVL_FLAG’
train %>%
group_by(MOBILENO_AVL_FLAG, LOAN_DEFAULT) %>%
summarise(count = n(), .groups = 'drop') %>%
group_by(MOBILENO_AVL_FLAG) %>%
mutate(percentage = count / sum(count)) %>%
print()
## # A tibble: 2 × 4
## # Groups: MOBILENO_AVL_FLAG [1]
## MOBILENO_AVL_FLAG LOAN_DEFAULT count percentage
## <dbl> <dbl> <int> <dbl>
## 1 1 0 182543 0.783
## 2 1 1 50611 0.217
‘AADHAR_FLAG’
train %>%
group_by(AADHAR_FLAG, LOAN_DEFAULT) %>%
summarise(count = n(), .groups = 'drop') %>%
group_by(AADHAR_FLAG) %>%
mutate(percentage = count / sum(count)) %>%
print()
## # A tibble: 4 × 4
## # Groups: AADHAR_FLAG [2]
## AADHAR_FLAG LOAN_DEFAULT count percentage
## <dbl> <dbl> <int> <dbl>
## 1 0 0 27684 0.744
## 2 0 1 9546 0.256
## 3 1 0 154859 0.790
## 4 1 1 41065 0.210
*‘PAN_FLAG’
train %>%
group_by(PAN_FLAG, LOAN_DEFAULT) %>%
summarise(count = n(), .groups = 'drop') %>%
group_by(PAN_FLAG) %>%
mutate(percentage = count / sum(count)) %>%
print()
## # A tibble: 4 × 4
## # Groups: PAN_FLAG [2]
## PAN_FLAG LOAN_DEFAULT count percentage
## <dbl> <dbl> <int> <dbl>
## 1 0 0 168799 0.783
## 2 0 1 46734 0.217
## 3 1 0 13744 0.780
## 4 1 1 3877 0.220
‘VOTERID_FLAG’
train %>%
group_by(VOTERID_FLAG, LOAN_DEFAULT) %>%
summarise(count = n(), .groups = 'drop') %>%
group_by(VOTERID_FLAG) %>%
mutate(percentage = count / sum(count)) %>%
print()
## # A tibble: 4 × 4
## # Groups: VOTERID_FLAG [2]
## VOTERID_FLAG LOAN_DEFAULT count percentage
## <dbl> <dbl> <int> <dbl>
## 1 0 0 157565 0.790
## 2 0 1 41795 0.210
## 3 1 0 24978 0.739
## 4 1 1 8816 0.261
‘DRIVING_FLAG’
train %>%
group_by(DRIVING_FLAG, LOAN_DEFAULT) %>%
summarise(count = n(), .groups = 'drop') %>%
group_by(DRIVING_FLAG) %>%
mutate(percentage = count / sum(count)) %>%
print()
## # A tibble: 4 × 4
## # Groups: DRIVING_FLAG [2]
## DRIVING_FLAG LOAN_DEFAULT count percentage
## <dbl> <dbl> <int> <dbl>
## 1 0 0 178216 0.783
## 2 0 1 49519 0.217
## 3 1 0 4327 0.798
## 4 1 1 1092 0.202
‘PASSPORT_FLAG’
train %>%
group_by(PASSPORT_FLAG, LOAN_DEFAULT) %>%
summarise(count = n(), .groups = 'drop') %>%
group_by(PASSPORT_FLAG) %>%
mutate(percentage = count / sum(count)) %>%
print()
## # A tibble: 4 × 4
## # Groups: PASSPORT_FLAG [2]
## PASSPORT_FLAG LOAN_DEFAULT count percentage
## <dbl> <dbl> <int> <dbl>
## 1 0 0 182121 0.783
## 2 0 1 50537 0.217
## 3 1 0 422 0.851
## 4 1 1 74 0.149
Combined
train %>%
group_by(LOAN_DEFAULT, EMPLOYMENT_TYPE, AADHAR_FLAG, PAN_FLAG, DRIVING_FLAG, PASSPORT_FLAG, VOTERID_FLAG) %>%
summarise(count = n(), .groups = 'drop') %>%
arrange(desc(count)) %>%
print()
## # A tibble: 103 × 8
## LOAN_DEFAULT EMPLOYMENT_TYPE AADHAR_FLAG PAN_FLAG DRIVING_FLAG PASSPORT_FLAG
## <dbl> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 0 Self employed 1 0 0 0
## 2 0 Salaried 1 0 0 0
## 3 1 Self employed 1 0 0 0
## 4 1 Salaried 1 0 0 0
## 5 0 Self employed 0 0 0 0
## 6 0 Salaried 0 0 0 0
## 7 0 <NA> 1 0 0 0
## 8 1 Self employed 0 0 0 0
## 9 0 Salaried 1 1 0 0
## 10 0 Self employed 1 1 0 0
## # ℹ 93 more rows
## # ℹ 2 more variables: VOTERID_FLAG <dbl>, count <int>
We can observe that the distribution between defaulted and non-defaulted client across all analyzed variables fluctuates around 21/79.
train$LOAN_DEFAULT <- as.factor(train$LOAN_DEFAULT)
ggplot(train, aes(x = DISBURSAL_DATE, fill = LOAN_DEFAULT)) +
geom_histogram(data = subset(train, LOAN_DEFAULT == 1),
bins = 50, alpha = 0.8, fill = "deeppink") +
geom_histogram(data = subset(train, LOAN_DEFAULT == 0),
bins = 50, alpha = 0.8, fill = "deepskyblue") +
facet_wrap(~LOAN_DEFAULT, ncol = 1, scales = "free_y",
labeller = as_labeller(c(`0` = "No default", `1` = "Default"))) +
labs(x = "DISBURSAL DATE", y = "Number of Loans") +
theme_bw() +
theme(legend.position = "none")
First let’s define a couple of useful functions that will help us quickly perform univariate analysis.
# Plot distribution of one feature
plot_distribution <- function(feature, color = "steelblue") {
ggplot(train, aes_string(x = feature)) + geom_histogram(aes(y =
..density..), fill = color, bins = 100, alpha = 0.7, na.rm = TRUE) +
geom_density(color = "black", size = 1, na.rm = TRUE) + labs(title =
paste("Distribution of", feature), x = feature, y = "Density") +
theme_minimal() }
# Plot distribution bar plot of several features
plot_bar_comp <- function(var, nrow = 2) {
plot_list <- list()
for (feature in var) {
p <- ggplot(train, aes_string(x = feature)) +
geom_bar(fill = "skyblue") +
labs(x = feature, y = "Count plot") +
theme_minimal(base_size = 12)
plot_list[[length(plot_list) + 1]] <- p
}
ncol <- 2
do.call(grid.arrange, c(plot_list, nrow = nrow, ncol = ncol))
}
# Plot distribution plot of several features
plot_distribution_comp <- function(data, variables, n_rows = 3, n_cols = 2) {
plot_list <- list()
for (var in variables) {
# Check if variable has very large values that need scientific notation
scientific <- max(data[[var]], na.rm = TRUE) > 1e6
p <- ggplot(data, aes_string(x = var, color = "factor(LOAN_DEFAULT)",
fill = "factor(LOAN_DEFAULT)")) +
geom_density(alpha = 0.2) +
scale_color_manual(name = "",
values = c("0" = "blue", "1" = "red"),
labels = c("LOAN_DEFAULT = 0", "LOAN_DEFAULT = 1")) +
scale_fill_manual(name = "",
values = c("0" = "blue", "1" = "red"),
labels = c("LOAN_DEFAULT = 0", "LOAN_DEFAULT = 1")) +
labs(x = var, y = "Density plot") +
theme_bw() +
theme(panel.grid.major = element_line(color = "grey90"),
panel.grid.minor = element_line(color = "grey90"),
legend.position = "bottom",
legend.box = "horizontal",
plot.title = element_text(size = 11))
# Use scientific notation for large value variables
if (scientific) {
p <- p + scale_x_continuous(labels = scientific_format())
}
plot_list[[length(plot_list) + 1]] <- p
}
# Arrange all plots in a grid
grid.arrange(grobs = plot_list, nrow = n_rows, ncol = n_cols)
}
# Box Plot for one feature
plot_box <- function(feature, color = "skyblue") { ggplot(train,
aes_string(y = feature)) + geom_boxplot(fill = color, outlier.color =
"red", na.rm = TRUE) + labs(title = paste("Box Plot of", feature), y =
feature) + theme_minimal() }
# Bar Plot for one feature
plot_bar <- function(feature) { ggplot(train, aes_string(y = feature,
fill = "factor(LOAN_DEFAULT)")) + geom_bar(position = "dodge", color =
"black") + scale_fill_manual(values = c("0" = "skyblue", "1" = "pink"),
name = "LOAN_DEFAULT", labels = c("No Default", "Default")) + labs(title
= paste("Bar Plot of", feature, "by Loan Default"), y = feature, x =
"Count") + theme_minimal() + theme( axis.text.y = element_text(size =
10), plot.title = element_text(size = 14, face = "bold") ) }
Now, let’s analyze important explanatory variable: ‘DISBURSED_AMOUNT’.
summary(train$DISBURSED_AMOUNT)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 13320 47145 53803 54357 60413 990572
plot_distribution("DISBURSED_AMOUNT", "green")
plot_box("DISBURSED_AMOUNT", "green")
Based on the box plot, we can observe the presence of the outliers. We will try two methods of dealing with outliers: replacing the outliers’ values with a sample mean and binning. But first, let’s define functions that will speed up the outlier detection process. We will create a function that will calculate all necessary statistical values: mean, lower and upper threshold.
outlier_data <- function(df, feature) {
# Number of observations
obs <- length(df[[feature]])
cat("No. of observations in column:", obs, "\n")
# Descriptive statistics
data_mean <- mean(df[[feature]], na.rm = TRUE)
data_sd <- sd(df[[feature]], na.rm = TRUE)
cat(sprintf("Statistics: Mean = %.3f, Std dev = %.3f\n", data_mean, data_sd))
# Thresholds for outliers, set as 3 standard deviations
cut_off <- data_sd * 3
lower <- data_mean - cut_off
upper <- data_mean + cut_off
# Outliers count
outliers <- df[[feature]][df[[feature]] < lower | df[[feature]] > upper]
cat("Identified outliers:", length(outliers), "\n")
return(list(lower = lower, upper = upper, mean = data_mean))
}
Now, let’s play with smoothing the outliers. We will create the function that will impute the outstanding obseravtions.
impute_outlier <- function(vec, lower, upper, mean_val) {
sapply(vec, function(x) {
if (is.na(x)) {
return(NA)
} else if (x <= lower || x >= upper) {
return(mean_val)
} else {
return(x)
}
})
}
disbursed_amount_stats <- outlier_data(train, 'DISBURSED_AMOUNT')
## No. of observations in column: 233154
## Statistics: Mean = 54356.994, Std dev = 12971.314
## Identified outliers: 3076
We manage to identify more that 3000 outliers. Now we will replace the values with the mean.
train$DISBURSED_AMOUNT_new <- impute_outlier(train$DISBURSED_AMOUNT, disbursed_amount_stats$lower, disbursed_amount_stats$upper, disbursed_amount_stats$mean)
# No. of observations after the imputation
cat("No. of observations in column: ", length(train$DISBURSED_AMOUNT_new), "\n")
## No. of observations in column: 233154
Now let’s try binning. Firstly, we have too divide our variable’s range into bins. We chose to divide it to 4 bins based on quantiles.
bin_labels <- c("Low", "Medium", "High", "Extreme")
quantiles <- quantile(train$DISBURSED_AMOUNT, probs = c(0, 0.25, 0.5, 0.75, 1), na.rm = TRUE)
# Quantile distribution
train$DISBURSED_AMOUNT_bins <- cut(train$DISBURSED_AMOUNT,
breaks = quantiles,
include.lowest = TRUE,
labels = bin_labels)
table(train$DISBURSED_AMOUNT_bins)
##
## Low Medium High Extreme
## 58537 58676 57734 58207
plot_bar("DISBURSED_AMOUNT_bins")
Now let’s analyze another variable: ‘ASSET_COST’
summary(train$ASSET_COST)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 37000 65717 70946 75865 79202 1628992
plot_distribution <- function(feature, color = "steelblue") {
ggplot(train, aes_string(x = feature)) +
geom_histogram(aes(y = ..density..), fill = color, bins = 100, alpha = 0.7, na.rm = TRUE) +
geom_density(color = "black", size = 1, na.rm = TRUE) +
labs(title = paste("Distribution of", feature), x = feature, y = "Density") +
theme_minimal()
}
plot_distribution("ASSET_COST", "tomato")
plot_box("ASSET_COST", "tomato")
Again, we can notice the presence of the outliers. We can apply similar approach as before.
Imputation
asset_cost_stats <- outlier_data(train, 'ASSET_COST')
## No. of observations in column: 233154
## Statistics: Mean = 75865.068, Std dev = 18944.781
## Identified outliers: 4425
Almost 4500 outliers detected - let’s try smoothing them out with mean
train$ASSET_COST_new <- impute_outlier(train$ASSET_COST, asset_cost_stats$lower, asset_cost_stats$upper, asset_cost_stats$mean)
# No. of observations after the imputation
cat("No. of observations in column: ", length(train$ASSET_COST_new), "\n")
## No. of observations in column: 233154
Binning
quantiles <- quantile(train$ASSET_COST, probs = c(0, 0.25, 0.5, 0.75, 1), na.rm = TRUE)
#Quantile distribution
train$ASSET_COST_bins <- cut(train$ASSET_COST,
breaks = quantiles,
include.lowest = TRUE,
labels = bin_labels)
table(train$ASSET_COST_bins)
##
## Low Medium High Extreme
## 58290 58288 58287 58289
plot_bar("ASSET_COST_bins")
‘LTV’
summary(train$LTV)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 10.03 68.88 76.80 74.75 83.67 95.00
plot_distribution <- function(feature, color = "steelblue") {
ggplot(train, aes_string(x = feature)) +
geom_histogram(aes(y = ..density..), fill = color, bins = 100, alpha = 0.7, na.rm = TRUE) +
geom_density(color = "black", size = 1, na.rm = TRUE) +
labs(title = paste("Distribution of", feature), x = feature, y = "Density") +
theme_minimal()
}
plot_distribution("LTV", "tomato")
plot_box("LTV", "tomato")
Imputation
LTV_stats <- outlier_data(train, 'LTV')
## No. of observations in column: 233154
## Statistics: Mean = 74.747, Std dev = 11.457
## Identified outliers: 2745
train$LTV_new <- impute_outlier(train$LTV, LTV_stats$lower, LTV_stats$upper, LTV_stats$mean)
# No. of observations after the imputation
cat("No. of observations in column: ", length(train$LTV_new), "\n")
## No. of observations in column: 233154
Binning
quantiles <- quantile(train$LTV, probs = c(0, 0.25, 0.5, 0.75, 1), na.rm = TRUE)
#Quantile distribution
train$LTV_bins <- cut(train$LTV,
breaks = quantiles,
include.lowest = TRUE,
labels = bin_labels)
table(train$LTV_bins)
##
## Low Medium High Extreme
## 58303 58299 58285 58267
plot_bar("LTV_bins")
‘PERFORM_CNS_SCORE’
summary(train$PERFORM_CNS_SCORE)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0 0.0 0.0 289.5 678.0 890.0
plot_distribution <- function(feature, color = "steelblue") {
ggplot(train, aes_string(x = feature)) +
geom_histogram(aes(y = ..density..), fill = color, bins = 100, alpha = 0.7, na.rm = TRUE) +
geom_density(color = "black", size = 1, na.rm = TRUE) +
labs(title = paste("Distribution of", feature), x = feature, y = "Density") +
theme_minimal()
}
plot_distribution("PERFORM_CNS_SCORE", "tomato")
plot_box("PERFORM_CNS_SCORE", "tomato")
Imputation
perform_cns_score_stats <- outlier_data(train, 'PERFORM_CNS_SCORE')
## No. of observations in column: 233154
## Statistics: Mean = 289.463, Std dev = 338.375
## Identified outliers: 0
train$PERFORM_CNS_SCORE_new <- impute_outlier(train$PERFORM_CNS_SCORE, perform_cns_score_stats$lower, perform_cns_score_stats$upper, perform_cns_score_stats$mean)
# No. of observations after the imputation
cat("No. of observations in column: ", length(train$PERFORM_CNS_SCORE_new), "\n")
## No. of observations in column: 233154
Binning
bin_labels = c("No History",'Very Low', "Low" ,'Medium', 'High')
cut_bins = c(-1,10,150, 350, 650, 1000)
quantiles <- quantile(train$PERFORM_CNS_SCORE, probs = c(0, 0.25, 0.5, 0.75, 1), na.rm = TRUE)
#Quantile distribution
train$PERFORM_CNS_SCORE_bins <- cut(train$PERFORM_CNS_SCORE,
breaks = cut_bins,
include.lowest = TRUE,
labels = bin_labels)
table(train$PERFORM_CNS_SCORE_bins)
##
## No History Very Low Low Medium High
## 116950 12835 9910 28425 65034
plot_bar("PERFORM_CNS_SCORE_bins")
‘PERFORM_CNS_SCORE_DESCRIPTION’
train %>%
group_by(PERFORM_CNS_SCORE_DESCRIPTION, PERFORM_CNS_SCORE_bins) %>%
summarise(count = n(), .groups = "drop") %>%
arrange(desc(count))
## # A tibble: 20 × 3
## PERFORM_CNS_SCORE_DESCRIPTION PERFORM_CNS_SCORE_bins count
## <chr> <fct> <int>
## 1 No Bureau History Available No History 116950
## 2 C-Very Low Risk High 16045
## 3 A-Very Low Risk High 14124
## 4 D-Very Low Risk High 11358
## 5 B-Very Low Risk High 9201
## 6 M-Very High Risk Low 8776
## 7 F-Low Risk High 8485
## 8 K-High Risk Medium 8277
## 9 H-Medium Risk Medium 6855
## 10 E-Low Risk High 5821
## 11 I-Medium Risk Medium 5557
## 12 G-Low Risk Medium 3988
## 13 Not Scored: Sufficient History Not Available Very Low 3765
## 14 J-High Risk Medium 3748
## 15 Not Scored: Not Enough Info available on the c… Very Low 3672
## 16 Not Scored: No Activity seen on the customer (… Very Low 2885
## 17 Not Scored: No Updates available in last 36 mo… Very Low 1534
## 18 L-Very High Risk Low 1134
## 19 Not Scored: Only a Guarantor Very Low 976
## 20 Not Scored: More than 50 active Accounts found Very Low 3
table(train$PERFORM_CNS_SCORE_DESCRIPTION)
##
## A-Very Low Risk
## 14124
## B-Very Low Risk
## 9201
## C-Very Low Risk
## 16045
## D-Very Low Risk
## 11358
## E-Low Risk
## 5821
## F-Low Risk
## 8485
## G-Low Risk
## 3988
## H-Medium Risk
## 6855
## I-Medium Risk
## 5557
## J-High Risk
## 3748
## K-High Risk
## 8277
## L-Very High Risk
## 1134
## M-Very High Risk
## 8776
## No Bureau History Available
## 116950
## Not Scored: More than 50 active Accounts found
## 3
## Not Scored: No Activity seen on the customer (Inactive)
## 2885
## Not Scored: No Updates available in last 36 months
## 1534
## Not Scored: Not Enough Info available on the customer
## 3672
## Not Scored: Only a Guarantor
## 976
## Not Scored: Sufficient History Not Available
## 3765
gg <- train %>%
group_by(PERFORM_CNS_SCORE_DESCRIPTION, LOAN_DEFAULT) %>%
summarise(counts = n(), .groups = "drop") %>%
group_by(PERFORM_CNS_SCORE_DESCRIPTION) %>%
mutate(percentage = counts / sum(counts) * 100) %>%
ungroup()
print(gg)
## # A tibble: 39 × 4
## PERFORM_CNS_SCORE_DESCRIPTION LOAN_DEFAULT counts percentage
## <chr> <fct> <int> <dbl>
## 1 A-Very Low Risk 0 11783 83.4
## 2 A-Very Low Risk 1 2341 16.6
## 3 B-Very Low Risk 0 7993 86.9
## 4 B-Very Low Risk 1 1208 13.1
## 5 C-Very Low Risk 0 13275 82.7
## 6 C-Very Low Risk 1 2770 17.3
## 7 D-Very Low Risk 0 9659 85.0
## 8 D-Very Low Risk 1 1699 15.0
## 9 E-Low Risk 0 4821 82.8
## 10 E-Low Risk 1 1000 17.2
## # ℹ 29 more rows
‘PRI_NO_OF_ACCTS’
summary(train$PRI_NO_OF_ACCTS)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.000 0.000 2.441 3.000 453.000
plot_distribution <- function(feature, color = "steelblue") {
ggplot(train, aes_string(x = feature)) +
geom_histogram(aes(y = ..density..), fill = color, bins = 100, alpha = 0.7, na.rm = TRUE) +
geom_density(color = "black", size = 1, na.rm = TRUE) +
labs(title = paste("Distribution of", feature), x = feature, y = "Density") +
theme_minimal()
}
plot_distribution("PRI_NO_OF_ACCTS", "tomato")
plot_box("PRI_NO_OF_ACCTS", "tomato")
Imputation
pri_no_of_accts_stats <- outlier_data(train, 'PRI_NO_OF_ACCTS')
## No. of observations in column: 233154
## Statistics: Mean = 2.441, Std dev = 5.217
## Identified outliers: 4119
train$PRI_NO_OF_ACCTS_new <- impute_outlier(train$PRI_NO_OF_ACCTS, pri_no_of_accts_stats$lower, pri_no_of_accts_stats$upper, pri_no_of_accts_stats$mean)
# No. of observations after the imputation
cat("No. of observations in column: ", length(train$PRI_NO_OF_ACCTS_new), "\n")
## No. of observations in column: 233154
Binning
bin_labels <- c("One", "More than One")
cut_bins <- c(-1, 1, 1000)
train$PRI_NO_OF_ACCTS_bins <- cut(train$PRI_NO_OF_ACCTS,
breaks = cut_bins,
include.lowest = TRUE,
labels = bin_labels)
table(train$PRI_NO_OF_ACCTS_bins)
##
## One More than One
## 151928 81226
plot_bar("PRI_NO_OF_ACCTS_bins")
‘PRI_OVERDUE_ACCTS’
summary(train$PRI_OVERDUE_ACCTS)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.1565 0.0000 25.0000
plot_distribution <- function(feature, color = "steelblue") {
ggplot(train, aes_string(x = feature)) +
geom_histogram(aes(y = ..density..), fill = color, bins = 100, alpha = 0.7, na.rm = TRUE) +
geom_density(color = "black", size = 1, na.rm = TRUE) +
labs(title = paste("Distribution of", feature), x = feature, y = "Density") +
theme_minimal()
}
plot_distribution("PRI_OVERDUE_ACCTS", "tomato")
plot_box("PRI_OVERDUE_ACCTS", "tomato")
Imputation
pri_overdue_accts_stats <- outlier_data(train, 'PRI_OVERDUE_ACCTS')
## No. of observations in column: 233154
## Statistics: Mean = 0.157, Std dev = 0.549
## Identified outliers: 6305
train$PRI_OVERDUE_ACCTS_new <- impute_outlier(train$PRI_OVERDUE_ACCTS, pri_overdue_accts_stats$lower, pri_overdue_accts_stats$upper, pri_overdue_accts_stats$mean)
# No. of observations after the imputation
cat("No. of observations in column: ", length(train$PRI_OVERDUE_ACCTS_new), "\n")
## No. of observations in column: 233154
Binning
bin_labels <- c("None", "One (or more)")
cut_bins <- c(-1, 0, 1000)
train$PRI_OVERDUE_ACCTS_bins <- cut(train$PRI_OVERDUE_ACCTS,
breaks = cut_bins,
include.lowest = TRUE,
labels = bin_labels)
table(train$PPRI_OVERDUE_ACCTS_bins)
## < table of extent 0 >
plot_bar("PRI_OVERDUE_ACCTS_bins")
Let’s look into data with lesser importance
var <- c("MOBILENO_AVL_FLAG", "AADHAR_FLAG", "PAN_FLAG", "VOTERID_FLAG", "PASSPORT_FLAG", "DRIVING_FLAG")
plot_bar_comp(var, nrow = 3)
‘EMPLOYMENT_TYPE’
ggplot(train, aes(x = EMPLOYMENT_TYPE, fill = factor(LOAN_DEFAULT))) +
geom_bar(position = "dodge", color = "black") + labs(title =
"EMPLOYMENT_TYPE vs LOAN_DEFAULT", x = "Employment Type", y = "Count",
fill = "Loan Default") + theme_minimal() + scale_fill_manual(values =
c("0" = "skyblue", "1" = "tomato"))
Age is in days
now <- Sys.Date()
train$DATE_OF_BIRTH <- as.Date(train$DATE_OF_BIRTH, format = "%d-%m-%Y")
# Calculate age in days
train$age <- as.numeric(difftime(now, train$DATE_OF_BIRTH, units = "days"))
# Print first few ages
print(head(train$age))
## [1] 15128 14551 14527 11477 17342 12686
train$disbursal_time <- as.numeric(difftime(now, train$DISBURSAL_DATE, units = "days"))
head(train$disbursal_time)
## [1] 2495 2441 2497 2411 2441 2448
‘MANUFACTURER_ID’
ggplot(train, aes(x = factor(MANUFACTURER_ID), fill = factor(LOAN_DEFAULT))) +
geom_bar(position = "dodge", color = "black") +
labs(title = "MANUFACTURER_ID vs LOAN_DEFAULT",
x = "MANUFACTURER_ID",
y = "Count",
fill = "Loan Default") +
scale_fill_manual(values = c("0" = "skyblue", "1" = "tomato"),
labels = c("No Default", "Default")) +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
‘BRANCH_ID’
ggplot(train, aes(x = factor(BRANCH_ID), fill = factor(LOAN_DEFAULT))) +
geom_bar(position = "dodge", color = "black") +
labs(title = "BRANCH_ID vs LOAN_DEFAULT",
x = "BRANCH_ID",
y = "Count",
fill = "Loan Default") +
scale_fill_manual(values = c("0" = "skyblue", "1" = "tomato"),
labels = c("No Default", "Default")) +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
var <- c("PRI_NO_OF_ACCTS_new", "PRI_ACTIVE_ACCTS", "PRI_OVERDUE_ACCTS_new",
"PRI_CURRENT_BALANCE", "PRI_SANCTIONED_AMOUNT", "PRI_DISBURSED_AMOUNT")
plot_distribution_comp(train, var)
Let’s see the new columns along with the less important continous variables
var <- c("SEC_NO_OF_ACCTS", "SEC_ACTIVE_ACCTS", "SEC_OVERDUE_ACCTS",
"SEC_CURRENT_BALANCE", "SEC_SANCTIONED_AMOUNT", "SEC_DISBURSED_AMOUNT")
plot_distribution_comp(train, var)
Now we will proceed with the feature selection. First, we will drop the variables that won’t be used in our model.
train <- train %>%
select(-DATE_OF_BIRTH, -STATE_ID, -EMPLOYEE_CODE_ID,
-SUPPLIER_ID, -MANUFACTURER_ID, -CURRENT_PINCODE_ID, -BRANCH_ID)
Now, let’s calucate the correlation matrix to see which features are correlated with each other in order to decrease a number of used variables.
corr_cols <- c("PRI_ACTIVE_ACCTS", "PRI_CURRENT_BALANCE",
"PRI_SANCTIONED_AMOUNT", "PRI_DISBURSED_AMOUNT", "SEC_NO_OF_ACCTS",
"SEC_ACTIVE_ACCTS", "SEC_OVERDUE_ACCTS", "SEC_CURRENT_BALANCE",
"SEC_SANCTIONED_AMOUNT", "SEC_DISBURSED_AMOUNT", "PRI_NO_OF_ACCTS_new",
"PRI_OVERDUE_ACCTS_new")
corr_data <- train[, corr_cols]
corr_matrix <- cor(corr_data, use = "pairwise.complete.obs")
corr_melted <- melt(corr_matrix)
# Create interactive heatmap
plot_ly(data = corr_melted,
x = ~Var1,
y = ~Var2,
z = ~value,
zmin = -1,
zmax = 1,
type = "heatmap",
colors = colorRamp(c("steelblue", "white", "darkgreen")))
Not highly correlated with anyone: ‘PRI_ACTIVE_ACCTS’, ‘PRI_CURRENT_BALANCE’,‘PRI_SANCTIONED_AMOUNT’, ‘PRI_DISBURSED_AMOUNT’,‘SEC_OVERDUE_ACCTS’.
‘PRI_NO_OF_ACCTS_new’,‘PRI_OVERDUE_ACCTS_new’ are perfectly positively correlated and hence we are keeping only one.
‘SEC_NO_OF_ACCTS’, ‘SEC_ACTIVE_ACCTS’ are highly positively correlated, hence we are keeping only one.
‘SEC_CURRENT_BALANCE’, ‘SEC_SANCTIONED_AMOUNT’, ‘SEC_DISBURSED_AMOUNT’ are highly positively correlated, hence we are keeping only one.
train <- train %>%
select(-PRI_OVERDUE_ACCTS_new, -SEC_ACTIVE_ACCTS, -SEC_SANCTIONED_AMOUNT, -SEC_DISBURSED_AMOUNT)
Now let’s analyze other variables.
corr_data <- train[, c('SEC_INSTAL_AMT',
'PERFORM_CNS_SCORE','NEW_ACCTS_IN_LAST_SIX_MONTHS',
'DELINQUENT_ACCTS_IN_LAST_SIX_MONTHS', 'AVERAGE_ACCT_AGE',
'CREDIT_HISTORY_LENGTH', 'NO_OF_INQUIRIES','age', 'disbursal_time')]
corr_matrix <- cor(corr_data, use = "pairwise.complete.obs")
corr_melted <- melt(corr_matrix)
# Create interactive heatmap
plot_ly(data = corr_melted,
x = ~Var1,
y = ~Var2,
z = ~value,
zmin = -1,
zmax = 1,
type = "heatmap",
colors = colorRamp(c("steelblue", "white", "darkgreen")))
‘AVERAGE_ACCT_AGE’, ‘CREDIT_HISTORY_LENGTH’ are highly positively correlated and hence we are keeping only one.
train <- train %>%
select(-AVERAGE_ACCT_AGE)
corr_data <- train[, c('PRI_ACTIVE_ACCTS', 'PRI_CURRENT_BALANCE',
'PRI_SANCTIONED_AMOUNT', 'PERFORM_CNS_SCORE', 'PRI_DISBURSED_AMOUNT',
'SEC_NO_OF_ACCTS', 'SEC_OVERDUE_ACCTS', 'SEC_CURRENT_BALANCE',
'PRIMARY_INSTAL_AMT', 'SEC_INSTAL_AMT', 'NEW_ACCTS_IN_LAST_SIX_MONTHS',
'DELINQUENT_ACCTS_IN_LAST_SIX_MONTHS', 'CREDIT_HISTORY_LENGTH',
'NO_OF_INQUIRIES', 'DISBURSED_AMOUNT_new', 'ASSET_COST_new', 'LTV_new',
'PRI_NO_OF_ACCTS_new', 'age', 'disbursal_time')]
corr_matrix <- cor(corr_data, use = "pairwise.complete.obs")
corr_melted <- melt(corr_matrix)
# Create interactive heatmap
plot_ly(data = corr_melted,
x = ~Var1,
y = ~Var2,
z = ~value,
zmin = -1,
zmax = 1,
type = "heatmap",
colors = colorRamp(c("steelblue", "white", "darkgreen")))
Based on the correlation matrix we decided to: Choose one out of ‘PRI_SANCTIONED_AMOUNT’, ‘PRI_DISBURSED_AMOUNT’ Choose one out of ‘LTV_new’, ‘PRI_NO_OF_ACCTS_new’ And eliminate ’NEW_ACCTS_IN_LAST_SIX_MONTHS
train <- train %>%
select(-PRI_SANCTIONED_AMOUNT,-PRI_NO_OF_ACCTS_new,-NEW_ACCTS_IN_LAST_SIX_MONTHS)
Now we will prepare our data sets. One will contain only continuous variables, with the outliers treated with imputation. Other will also contain binned variables.
Continuous Variables
train_con <- train[, c('EMPLOYMENT_TYPE', 'MOBILENO_AVL_FLAG',
'AADHAR_FLAG', 'PAN_FLAG', 'VOTERID_FLAG', 'DRIVING_FLAG',
'PASSPORT_FLAG', 'PERFORM_CNS_SCORE', 'PERFORM_CNS_SCORE_DESCRIPTION',
'PRI_ACTIVE_ACCTS', 'PRI_CURRENT_BALANCE', 'PRI_DISBURSED_AMOUNT',
'SEC_NO_OF_ACCTS', 'SEC_OVERDUE_ACCTS', 'SEC_CURRENT_BALANCE',
'PRIMARY_INSTAL_AMT', 'SEC_INSTAL_AMT',
'DELINQUENT_ACCTS_IN_LAST_SIX_MONTHS', 'CREDIT_HISTORY_LENGTH',
'NO_OF_INQUIRIES', 'LOAN_DEFAULT', 'DISBURSED_AMOUNT_new',
'ASSET_COST_new', 'LTV_new', 'age', 'disbursal_time')]
Binned Variables
train_bin <- train[, c('UNIQUEID', 'EMPLOYMENT_TYPE',
'MOBILENO_AVL_FLAG', 'AADHAR_FLAG', 'PAN_FLAG', 'VOTERID_FLAG',
'DRIVING_FLAG', 'PASSPORT_FLAG', 'PERFORM_CNS_SCORE',
'PERFORM_CNS_SCORE_DESCRIPTION', 'PRI_ACTIVE_ACCTS',
'PRI_CURRENT_BALANCE', 'PRI_DISBURSED_AMOUNT', 'SEC_NO_OF_ACCTS',
'SEC_OVERDUE_ACCTS', 'SEC_CURRENT_BALANCE', 'PRIMARY_INSTAL_AMT',
'SEC_INSTAL_AMT', 'DELINQUENT_ACCTS_IN_LAST_SIX_MONTHS',
'CREDIT_HISTORY_LENGTH', 'NO_OF_INQUIRIES', 'LOAN_DEFAULT',
'DISBURSED_AMOUNT_bins', 'ASSET_COST_bins', 'LTV_bins',
'PERFORM_CNS_SCORE_bins', 'PRI_NO_OF_ACCTS_bins',
'PRI_OVERDUE_ACCTS_bins', 'age', 'disbursal_time')]
Let’s transform some of the variables for better modelling.
scaleColumns <- function(df, cols_to_scale) {
for (col in cols_to_scale) {
df[[col]] <- scale(df[[col]])
}
return(df)
}
scaled_df <- scaleColumns(train_con, c('PERFORM_CNS_SCORE',
'PRI_ACTIVE_ACCTS', 'PRI_CURRENT_BALANCE', 'PRI_DISBURSED_AMOUNT',
'SEC_NO_OF_ACCTS', 'SEC_OVERDUE_ACCTS', 'SEC_CURRENT_BALANCE',
'PRIMARY_INSTAL_AMT', 'SEC_INSTAL_AMT',
'DELINQUENT_ACCTS_IN_LAST_SIX_MONTHS', 'CREDIT_HISTORY_LENGTH',
'NO_OF_INQUIRIES', 'DISBURSED_AMOUNT_new', 'ASSET_COST_new', 'LTV_new',
'age', 'disbursal_time')
)
head(scaled_df)
## # A tibble: 6 × 26
## EMPLOYMENT_TYPE MOBILENO_AVL_FLAG AADHAR_FLAG PAN_FLAG VOTERID_FLAG
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Salaried 1 1 0 0
## 2 Self employed 1 1 0 0
## 3 Self employed 1 1 0 0
## 4 Self employed 1 1 0 0
## 5 Self employed 1 1 0 0
## 6 Self employed 1 1 0 0
## # ℹ 21 more variables: DRIVING_FLAG <dbl>, PASSPORT_FLAG <dbl>,
## # PERFORM_CNS_SCORE <dbl[,1]>, PERFORM_CNS_SCORE_DESCRIPTION <chr>,
## # PRI_ACTIVE_ACCTS <dbl[,1]>, PRI_CURRENT_BALANCE <dbl[,1]>,
## # PRI_DISBURSED_AMOUNT <dbl[,1]>, SEC_NO_OF_ACCTS <dbl[,1]>,
## # SEC_OVERDUE_ACCTS <dbl[,1]>, SEC_CURRENT_BALANCE <dbl[,1]>,
## # PRIMARY_INSTAL_AMT <dbl[,1]>, SEC_INSTAL_AMT <dbl[,1]>,
## # DELINQUENT_ACCTS_IN_LAST_SIX_MONTHS <dbl[,1]>, …
Let’s turn the categorical variables into the dummy variables.
train_dummy <- fastDummies::dummy_cols( scaled_df, remove_first_dummy = TRUE, remove_selected_columns = TRUE, ignore_na = TRUE)
head(train_dummy)
## # A tibble: 6 × 44
## MOBILENO_AVL_FLAG AADHAR_FLAG PAN_FLAG VOTERID_FLAG DRIVING_FLAG PASSPORT_FLAG
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 1 0 0 0 0
## 2 1 1 0 0 0 0
## 3 1 1 0 0 0 0
## 4 1 1 0 0 0 0
## 5 1 1 0 0 0 0
## 6 1 1 0 0 0 0
## # ℹ 38 more variables: PERFORM_CNS_SCORE <dbl>, PRI_ACTIVE_ACCTS <dbl>,
## # PRI_CURRENT_BALANCE <dbl>, PRI_DISBURSED_AMOUNT <dbl>,
## # SEC_NO_OF_ACCTS <dbl>, SEC_OVERDUE_ACCTS <dbl>, SEC_CURRENT_BALANCE <dbl>,
## # PRIMARY_INSTAL_AMT <dbl>, SEC_INSTAL_AMT <dbl>,
## # DELINQUENT_ACCTS_IN_LAST_SIX_MONTHS <dbl>, CREDIT_HISTORY_LENGTH <dbl>,
## # NO_OF_INQUIRIES <dbl>, DISBURSED_AMOUNT_new <dbl>, ASSET_COST_new <dbl>,
## # LTV_new <dbl>, age <dbl>, disbursal_time <dbl>, …
Finally, let’s divide our modified dataset into a train and test set.
y <- train_dummy["LOAN_DEFAULT_1"]
X <- train_dummy[, setdiff(names(train_dummy), "LOAN_DEFAULT_1")]
dim(y)
## [1] 233154 1
set.seed(101)
train_indices <- createDataPartition(y$LOAN_DEFAULT_1, p = 0.80, list = FALSE)
# Create training and testing datasets
X_train <- X[train_indices, ]
X_test <- X[-train_indices, ]
y_train <- y[train_indices, ]
y_test <- y[-train_indices, ]
k_fold <- trainControl(method = "cv", number = 10)
X_train$`EMPLOYMENT_TYPE_Self employed`[is.na(X_train$`EMPLOYMENT_TYPE_Self employed`)] <- 0
X_test$`EMPLOYMENT_TYPE_Self employed`[is.na(X_test$`EMPLOYMENT_TYPE_Self employed`)] <- 0
Now we can proceed to modeling. Firstly, we will define a couple of functions that will help us in the modeling process.
# Confusion Matrix
plot_confusion_matrix <- function(cm, classes, normalize = FALSE, title = "Confusion Matrix") {
cm_df <- as.data.frame(cm)
colnames(cm_df) <- c("True", "Predicted","Freq")
if (normalize) {
cm_df <- cm_df %>%
group_by(True) %>%
mutate(Freq = Freq / sum(Freq)) %>%
ungroup()
}
max_val <- max(cm_df$Freq)
cm_df$text_color <- ifelse(cm_df$Freq > max_val / 2, "white", "black")
decimals <- if(normalize) 2 else 0
ggplot(cm_df, aes(x = Predicted, y = True, fill = Freq)) +
geom_tile(color = "white") +
geom_text(aes(label = round(Freq, decimals), color = text_color), size = 4) +
scale_fill_gradient(low = "white", high = "blue") +
scale_color_identity() +
labs(title = title, x = "Predicted label", y = "True label") +
theme_minimal()
}
# Precision, Recall, F1 Score
show_metrics <- function(cm) {
cm <- as.matrix(cm)
tn <- cm[1, 1]
fp <- cm[1, 2]
fn <- cm[2, 1]
tp <- cm[2, 2]
precision <- tp / (tp + fp)
recall <- tp / (tp + fn)
f1_score <- 2 * ((precision * recall) / (precision + recall))
cat(sprintf("Precision = %.3f\n", precision))
cat(sprintf("Recall = %.3f\n", recall))
cat(sprintf("F1_score = %.3f\n", f1_score))
}
# Precision-recall curve
plot_precision_recall <- function(recall, precision) {
df <- data.frame(recall = recall, precision = precision)
ggplot(df, aes(x = recall, y = precision)) +
geom_step(direction = "hv", alpha = 0.5, color = "blue") +
geom_ribbon(aes(ymin = 0, ymax = precision), alpha = 0.2, fill = "blue") +
xlim(0.0, 1.0) + ylim(0.0, 1.05) +
labs(title = "Precision-Recall Curve", x = "Recall", y = "Precision") +
theme_minimal()
}
# ROC curve
plot_roc <- function(fpr, tpr) {
df <- data.frame(fpr = fpr, tpr = tpr)
ggplot(df, aes(x = fpr, y = tpr)) +
geom_line(color = "blue", linewidth = 1.2) +
geom_abline(slope = 1, intercept = 0, linetype = "dashed", color= "black", linewidth = 1) +
xlim(0.0, 1.0) + ylim(0.0, 1.05) +
labs(title = "ROC Curve", x = "False Positive Rate", y = "True Positive Rate") +
theme_minimal()
}
#feature importance plot
plot_feature_importance <- function(model, predictors) {
tmp <- data.frame( Feature = predictors, Feature_importance = model$importance)
tmp <- tmp[order(tmp$Feature_importance, decreasing = TRUE), ]
ggplot(tmp, aes(x = reorder(Feature, Feature_importance), y =Feature_importance)) +
geom_bar(stat = "identity", fill = "steelblue") +
coord_flip() + labs(title = "Features importance", x = "Feature", y ="Feature importance") +
theme_minimal() + theme(plot.title = element_text(size = 14))
}
print(class(y_train))
print(class(y_train))
## [1] "tbl_df" "tbl" "data.frame"
if(is.list(y_train) && length(y_train) == 1) {
y_train <- unlist(y_train)
}
logmodel <- glm(y_train ~ ., data = X_train, family = "binomial")
log_probs <- predict(logmodel, newdata = X_test, type = "response")
# For class predictions (equivalent to Python's predict):
logpred <- ifelse(log_probs > 0.5, 1, 0)
# Print confusion matrix
if(is.data.frame(logpred)) {
logpred <- as.vector(unlist(logpred))
}
if(is.list(y_test) && length(y_test) == 1) {
y_test <- unlist(y_test)
}
# Cross-validation
# Define k-fold cross-validation control
ctrl <- trainControl(method = "cv", number = 10) # Assuming k_fold = 5 in your Python code
# Perform cross-validation
cv_model <- train(
x = as.matrix(X_train),
y = factor(y_train),
method = "glm",
family = "binomial",
trControl = ctrl,
metric = "Accuracy"
)
# Print cross-validated accuracy
LOGCV <- cv_model$results$Accuracy
print(paste0("Cross-validated accuracy: ", round(LOGCV * 100, 2), "%"))
## [1] "Cross-validated accuracy: 78.3%"
cm_log <- conf_matrix <- confusionMatrix(factor(logpred, levels = c(0,1)), factor(y_test, levels = c(0,1)))
saveRDS(cm_log, "cm_log.rds")
# Extract and print the metrics
# Accuracy
cat("Accuracy of model", cm_log$overall["Accuracy"], "\n")
## Accuracy of model 0.7815355
# F1 Score
cat("F1 Score", cm_log$byClass["F1"], "\n")
## F1 Score 0.8772162
# Recall Score
cat("Recall Score", cm_log$byClass["Sensitivity"], "\n")
## Recall Score 0.9986553
# Balanced Accuracy Score
cat("Balanced Accuracy Score", cm_log$byClass["Balanced Accuracy"], "\n")
## Balanced Accuracy Score 0.501928
set.seed(42)
if(!is.factor(y_train)) {
y_train_factor <- factor(y_train)
} else {
y_train_factor <- y_train
}
# Train the Random Forest model
rfc <- randomForest(
x = X_train,
y = y_train_factor,
ntree = 10,
importance = TRUE
)
# Predict on test set
rfc_pred <- predict(rfc, X_test)
# Ensure factors have same levels for confusion matrix
y_test_factor <- factor(y_test)
rfc_pred_factor <- factor(rfc_pred, levels = levels(y_test_factor))
# Create confusion matrix and print it
cm_rf <- confusionMatrix(rfc_pred_factor, factor(y_test))
print(cm_rf$table)
## Reference
## Prediction 0 1
## 0 35792 9807
## 1 647 384
saveRDS(cm_rf, "cm_rf.rds")
# Print results (equivalent to the Python code)
cat("Accuracy of model", cm_rf$overall["Accuracy"], "\n")
## Accuracy of model 0.7758096
cat("F1 Score", cm_rf$byClass["F1"], "\n")
## F1 Score 0.8725712
cat("Recall Score", cm_rf$byClass["Sensitivity"], "\n")
## Recall Score 0.9822443
cat("Balanced Accuracy Score", cm_rf$byClass["Balanced Accuracy"], "\n")
## Balanced Accuracy Score 0.5099623
nb_model <- naiveBayes(x = X_train, y = factor(y_train))
# Predict on test set
nb_pred <- predict(nb_model, newdata = X_test, type = "class")
# Ensure predictions and actual values are factors with consistent levels for confusion matrix
y_test_factor <- factor(y_test, levels = levels(nb_pred))
# Create confusion matrix and print it
cm_nb <- confusionMatrix(nb_pred, y_test_factor)
saveRDS(cm_nb, "cm_nb.rds")
print(cm_nb$table)
## Reference
## Prediction 0 1
## 0 11838 2248
## 1 24601 7943
# Print accuracy as percentage
cat("Accuracy of model (Naive Bayes): ", round(cm_nb$overall["Accuracy"] * 100, 2), "%\n")
## Accuracy of model (Naive Bayes): 42.42 %
cat("F1 Score: ", cm_nb$byClass["F1"], "\n")
## F1 Score: 0.4685997
cat("Recall Score (Sensitivity): ", cm_nb$byClass["Sensitivity"], "\n")
## Recall Score (Sensitivity): 0.3248717
cat("Balanced Accuracy Score: ", cm_nb$byClass["Balanced Accuracy"], "\n")
## Balanced Accuracy Score: 0.5521425
set.seed(101)
# Preprocess y_train
if (is.list(y_train) || is.data.frame(y_train)) {
y_train <- unlist(y_train)
}
y_train <- factor(y_train)
y_test <- factor(y_test, levels = levels(y_train))
k_fold <- 10
ctrl <- trainControl(method = "cv", number = k_fold, classProbs = FALSE, summaryFunction = defaultSummary)
# Train SGD model using caret with glmnet method (elastic net logistic regression, similar to SGD)
sgd_model <- train(
x = X_train,
y = y_train,
method = "glmnet",
trControl = ctrl,
metric = "Accuracy"
)
# Predict on test data
sgd_pred <- predict(sgd_model, X_test)
# Confusion Matrix
cm_sgd <- confusionMatrix(sgd_pred, y_test, positive = levels(y_test)[2])
saveRDS(cm_sgd, "cm_sgd.rds")
print(cm_sgd$table)
## Reference
## Prediction 0 1
## 0 36438 10191
## 1 1 0
# Accuracy
accuracy <- cm_sgd$overall["Accuracy"]
cat("Accuracy of model:", round(accuracy * 100, 2), "%\n")
## Accuracy of model: 78.14 %
# F1 Score
f1 <- F1_Score(y_pred = sgd_pred, y_true = y_test, positive = levels(y_test)[2])
cat("F1 Score:", round(f1, 3), "\n")
## F1 Score: NaN
# Recall (Sensitivity)
recall <- cm_sgd$byClass["Sensitivity"]
cat("Recall Score:", round(recall, 3), "\n")
## Recall Score: 0
# Balanced Accuracy
cat("Balanced Accuracy Score:", cm_sgd$byClass["Balanced Accuracy"], "\n")
## Balanced Accuracy Score: 0.4999863
# Cross-validated accuracy
cv_accuracy <- max(sgd_model$results$Accuracy)
cat("Cross-validated accuracy:", round(cv_accuracy * 100, 2), "%\n")
## Cross-validated accuracy: 78.33 %
set.seed(101)
# Preprocess target variables
if (is.list(y_train) || is.data.frame(y_train)) {
y_train <- unlist(y_train)
}
y_train <- factor(y_train)
y_test <- factor(y_test, levels = levels(y_train))
k_fold <- 10
ctrl <- trainControl(method = "cv", number = k_fold)
# Train Decision Tree model with parameters similar to sklearn
dtree_model <- train(
x = X_train,
y = y_train,
method = "rpart",
tuneGrid = data.frame(cp = 0),
trControl = ctrl,
control = rpart.control(maxdepth = 10, minbucket = 30, maxcompete = 0, maxsurrogate = 0)
)
# Predict on test set
dtree_pred <- predict(dtree_model, X_test)
# Confusion matrix
cm_dt <- confusionMatrix(dtree_pred, y_test, positive = levels(y_test)[2])
saveRDS(cm_dt, "cm_dt.rds")
print(cm_dt$table)
## Reference
## Prediction 0 1
## 0 36125 9997
## 1 314 194
# Accuracy
accuracy <- cm_dt$overall["Accuracy"]
cat("Accuracy of model:", round(accuracy * 100, 2), "%\n")
## Accuracy of model: 77.89 %
# F1 Score
f1 <- F1_Score(y_pred = dtree_pred, y_true = y_test, positive = levels(y_test)[2])
cat("F1 Score:", round(f1, 3), "\n")
## F1 Score: 0.036
# Recall (Sensitivity)
recall <- cm_dt$byClass["Sensitivity"]
cat("Recall Score:", round(recall, 3), "\n")
## Recall Score: 0.019
# Balanced Accuracy
cat("Balanced Accuracy Score:", cm_dt$byClass["Balanced Accuracy"], "\n")
## Balanced Accuracy Score: 0.5052096
# Cross-validated accuracy (already done during train)
cv_accuracy <- max(dtree_model$results$Accuracy)
cat("Cross-validated accuracy:", round(cv_accuracy * 100, 2), "%\n")
## Cross-validated accuracy: 77.97 %
train_with_progress_bar <- function(X_train, y_train, nrounds = 100) {
# Convert data
dtrain <- xgb.DMatrix(data = as.matrix(X_train), label = as.numeric(y_train) - 1)
# Create progress bar
pb <- progress_bar$new(
format = " Training [:bar] :percent eta: :eta",
total = nrounds,
clear = FALSE,
width = 60
)
# Create a proper named callback function
progress_callback <- function(env = parent.frame()) {
pb$tick()
return(TRUE)
}
# Assign a name attribute to the callback
attr(progress_callback, "name") <- "progress_callback"
# Train model with callback
model <- xgb.train(
params = list(
objective = "binary:logistic",
eval_metric = "logloss"
),
data = dtrain,
nrounds = nrounds,
callbacks = list(progress_callback),
verbose = 0 # Disable default verbosity
)
return(model)
}
# Call the function
xgb_model <- train_with_progress_bar(X_train, y_train, nrounds = 100)
# Predict on test set
dtest <- xgb.DMatrix(data = as.matrix(X_test), label = as.numeric(y_test) - 1)
xgb_pred <- predict(xgb_model, dtest)
predictions <- ifelse(xgb_pred > 0.5, 1, 0)
y_test_numeric <- as.numeric(y_test) -1
# Confusion matrix
cm_xgb <- confusionMatrix(factor(predictions, levels = c(0,1)), factor(y_test_numeric, levels = c(0,1)))
saveRDS(cm_xgb, "cm_xgb.rds")
print(cm_xgb$table)
## Reference
## Prediction 0 1
## 0 36212 9993
## 1 227 198
# Accuracy
accuracy <- cm_xgb$overall["Accuracy"]
cat("Accuracy of model:", round(accuracy * 100, 2), "%\n")
## Accuracy of model: 78.08 %
# F1 Score
f1 <- F1_Score(factor(predictions, levels = c(0,1)), factor(y_test_numeric, levels = c(0,1)))
cat("F1 Score:", round(f1, 3), "\n")
## F1 Score: 0.876
# Recall (Sensitivity)
recall <- cm_xgb$byClass["Sensitivity"]
cat("Recall Score:", round(recall, 3), "\n")
## Recall Score: 0.994
# Balanced Accuracy
cat("Balanced Accuracy Score:", cm_xgb$byClass["Balanced Accuracy"], "\n")
## Balanced Accuracy Score: 0.5065997
SMOTE or Synthetic Minority Oversampling Technique is used to create synthetic data. SMOTE uses a nearest neighbors algorithm to generate new and synthetic data we can use for training our model.
set.seed(27)
train_indices <- createDataPartition(y$LOAN_DEFAULT_1, p = 0.75, list = FALSE)
X_train <- X[train_indices, ]
X_test <- X[-train_indices, ]
y_train <- y[train_indices,]
y_test <- y[-train_indices,]
X_train$`EMPLOYMENT_TYPE_Self employed`[is.na(X_train$`EMPLOYMENT_TYPE_Self employed`)] <- 0
X_test$`EMPLOYMENT_TYPE_Self employed`[is.na(X_test$`EMPLOYMENT_TYPE_Self employed`)] <- 0
# Combine features and target for SMOTE
train_data <- cbind(X_train, target = y_train)
set.seed(27)
smote_result <- SMOTE(X_train, y_train, K = 5, dup_size = 1)
X_train <- smote_result$data
y_train <- X_train$class
X_train$class <- NULL
Stochastic Gradient Descent with Modified Huber Loss
#' SGD with Modified Huber Loss Implementation
#'
#' @param X Matrix or data frame of features
#' @param y Vector of binary labels (0 or 1)
#' @param learning_rate Initial learning rate
#' @param epochs Number of epochs (passes through the data)
#' @param batch_size Size of mini-batches
#' @param alpha L2 regularization parameter
#' @param shuffle Whether to shuffle the data at each epoch
#' @param random_seed Random seed for reproducibility
#' @param verbose Whether to print progress
#'
#' @return List containing weights, bias, and training history
sgd_modified_huber <- function(X, y, learning_rate = 0.01, epochs = 100,
batch_size = 32, alpha = 0.0001,
shuffle = TRUE, random_seed = NULL,
verbose = TRUE) {
# Set random seed if provided
if (!is.null(random_seed)) {
set.seed(random_seed)
}
# Ensure X is a matrix
if (is.data.frame(X)) {
X <- as.matrix(X)
}
# Initialize parameters
n_samples <- nrow(X)
n_features <- ncol(X)
w <- rep(0, n_features)
b <- 0
# Initialize training history
history <- data.frame(epoch = integer(), loss = numeric())
# Modified Huber loss gradient function
modified_huber_gradient <- function(y_true, y_pred) {
# Calculate margin = y * (w*x + b)
margin <- y_true * y_pred
# Calculate gradients based on the margin
gradients <- rep(0, length(margin))
case1 <- margin <= -1
case2 <- margin > -1 & margin < 1
gradients[case1] <- -y_true[case1]
gradients[case2] <- -y_true[case2] * (1 - margin[case2])
return(gradients)
}
# Shuffle function
shuffle_data <- function(X, y) {
idx <- sample(n_samples)
return(list(X = X[idx,], y = y[idx]))
}
# Training loop
for (epoch in 1:epochs) {
epoch_loss <- 0
# Shuffle data if requested
if (shuffle) {
shuffled <- shuffle_data(X, y)
X_epoch <- shuffled$X
y_epoch <- shuffled$y
} else {
X_epoch <- X
y_epoch <- y
}
# Process mini-batches
for (i in seq(1, n_samples, by = batch_size)) {
# Get mini-batch
end_idx <- min(i + batch_size - 1, n_samples)
X_batch <- X_epoch[i:end_idx, , drop = FALSE]
y_batch <- y_epoch[i:end_idx]
# Convert y to {-1, 1} for modified Huber loss
y_binary <- 2 * y_batch - 1
# Forward pass
scores <- X_batch %*% w + b
# Compute gradients
grads <- modified_huber_gradient(y_binary, scores)
# Update parameters
grad_w <- t(X_batch) %*% grads / length(y_batch) + alpha * w
grad_b <- mean(grads)
w <- w - learning_rate * grad_w
b <- b - learning_rate * grad_b
# Compute loss for monitoring
loss_batch <- calculate_modified_huber_loss(y_binary, scores)
epoch_loss <- epoch_loss + loss_batch * length(y_batch)
}
# Average loss for the epoch
epoch_loss <- epoch_loss / n_samples
# Save history
history <- rbind(history, data.frame(epoch = epoch, loss = epoch_loss))
# Print progress
if (verbose && epoch %% 10 == 0) {
cat(sprintf("Epoch %d/%d - Loss: %.4f\n", epoch, epochs, epoch_loss))
}
}
# Return model parameters and history
return(list(
weights = w,
bias = b,
history = history
))
}
# Helper function to calculate modified Huber loss
calculate_modified_huber_loss <- function(y_true, scores) {
margin <- y_true * scores
# Calculate loss based on the margin
losses <- rep(0, length(margin))
case1 <- margin <= -1
case2 <- margin > -1 & margin < 1
losses[case1] <- -4 * margin[case1]
losses[case2] <- (1 - margin[case2])^2
return(mean(losses))
}
# Prediction function
predict_sgd <- function(X, model, threshold = 0.5) {
if (is.data.frame(X)) {
X <- as.matrix(X)
}
# Calculate raw scores
scores <- X %*% model$weights + model$bias
# Convert scores to probabilities (simplified sigmoid)
probs <- 1 / (1 + exp(-scores))
# Return class predictions
return(ifelse(probs > threshold, 1, 0))
}
y_train <- as.numeric(y_train)
# Fit the model
model <- sgd_modified_huber(
X = X_train,
y = y_train,
learning_rate = 0.01,
epochs = 100,
batch_size = 32,
alpha = 0.0001,
shuffle = TRUE,
random_seed = 101,
verbose = TRUE
)
## Epoch 10/100 - Loss: 0.8784
## Epoch 20/100 - Loss: 0.8784
## Epoch 30/100 - Loss: 0.8782
## Epoch 40/100 - Loss: 0.8783
## Epoch 50/100 - Loss: 0.8785
## Epoch 60/100 - Loss: 0.8783
## Epoch 70/100 - Loss: 0.8783
## Epoch 80/100 - Loss: 0.8782
## Epoch 90/100 - Loss: 0.8786
## Epoch 100/100 - Loss: 0.8787
# # Make predictions
y_pred <- predict_sgd(X_test, model)
y_test_factor <- factor(y_test$LOAN_DEFAULT_1)
cm_sgd_smote <- confusionMatrix(factor(y_pred, levels = c(0,1)), y_test_factor)
print(cm_sgd_smote$table)
## Reference
## Prediction 0 1
## 0 43371 11101
## 1 2431 1385
print(paste0(round(cm_sgd_smote$overall["Accuracy"], 2) * 100, "%"))
## [1] "77%"
# Accuracy
cat("Accuracy of model", cm_sgd_smote$overall["Accuracy"], "\n")
## Accuracy of model 0.7678424
# F1 Score
cat("F1 Score", cm_sgd_smote$byClass["F1"], "\n")
## F1 Score 0.8650498
# Recall Score
cat("Recall Score", cm_sgd_smote$byClass["Sensitivity"], "\n")
## Recall Score 0.9469237
# Balanced Accuracy Score
cat("Balanced Accuracy Score", cm_sgd_smote$byClass["Balanced Accuracy"], "\n")
## Balanced Accuracy Score 0.528924
Decision Tree Classifier
set.seed(101) # Equivalent to random_state in Python
# Train the decision tree
dtree <- rpart(
formula = y_train ~ .,
data = cbind(X_train, y_train = y_train),
method = "class",
control = rpart.control(
maxdepth = 10,
minsplit = 60,
minbucket = 30,
cp = 0.001
)
)
# Predict on test set
dtree_pred <- predict(dtree, X_test, type = "class")
# Ensure predictions and actual values are factors with the same levels
y_test_factor <- factor(y_test$LOAN_DEFAULT_1)
dtree_pred_factor <- factor(dtree_pred, levels = levels(y_test_factor))
# Calculate confusion matrix
cm_smote_dt <- confusionMatrix(dtree_pred_factor, y_test_factor)
saveRDS(cm_smote_dt, "cm_smote_dt.rds")
print(cm_smote_dt$table)
## Reference
## Prediction 0 1
## 0 43661 11408
## 1 2141 1078
# Calculate accuracy
accuracy <- cm_smote_dt$overall["Accuracy"]
print(paste0(round(accuracy * 100, 2), "%"))
## [1] "76.76%"
# Accuracy
cat("Accuracy of model", cm_smote_dt$overall["Accuracy"], "\n")
## Accuracy of model 0.7675508
# F1 Score
cat("F1 Score", cm_smote_dt$byClass["F1"], "\n")
## F1 Score 0.8656799
# Recall Score
cat("Recall Score", cm_smote_dt$byClass["Sensitivity"], "\n")
## Recall Score 0.9532553
# Balanced Accuracy Score
cat("Balanced Accuracy Score", cm_smote_dt$byClass["Balanced Accuracy"], "\n")
## Balanced Accuracy Score 0.519796
Random Forest Classifier
set.seed(42)
if(!is.factor(y_train)) {
y_train_factor <- factor(y_train)
} else {
y_train_factor <- y_train
}
# Train the Random Forest model
rfc <- randomForest(
x = X_train,
y = y_train_factor,
ntree = 10,
importance = TRUE
)
# Predict on test set
rfc_pred <- predict(rfc, X_test)
rfc_pred_factor <- factor(rfc_pred, levels = levels(y_test_factor))
y_test_factor <- factor(y_test$LOAN_DEFAULT_1)
# Calculate confusion matrix
conf_matrix <- confusionMatrix(rfc_pred_factor, y_test_factor)
print(conf_matrix$table)
## Reference
## Prediction 0 1
## 0 42497 10887
## 1 3305 1599
# Calculate accuracy and print as percentage
accuracy <- conf_matrix$overall["Accuracy"]
print(paste0("Accuracy: ", round(accuracy * 100, 2), "%"))
## [1] "Accuracy: 75.65%"
# Extract and print the metrics
# Accuracy
cat("Accuracy of model", conf_matrix$overall["Accuracy"], "\n")
## Accuracy of model 0.7565194
# F1 Score
cat("F1 Score", conf_matrix$byClass["F1"], "\n")
## F1 Score 0.8569153
# Recall Score
cat("Recall Score", conf_matrix$byClass["Sensitivity"], "\n")
## Recall Score 0.9278416
# Balanced Accuracy Score
cat("Balanced Accuracy Score", conf_matrix$byClass["Balanced Accuracy"], "\n")
## Balanced Accuracy Score 0.5279525
With SMOTE procedure applied, the best results, considering Accuracy, F1 Score and Recall, were obtained by Decision Tree Classifier model.
Upsampling can be defined as adding more copies of the minority class. Upsampling can be a good choice when you don’t have a ton of data to work with. (Not a good choice here though)
y <- train_dummy["LOAN_DEFAULT_1"]
X <- train_dummy[, setdiff(names(train_dummy), "LOAN_DEFAULT_1")]
set.seed(27)
train_indices <- createDataPartition(y$LOAN_DEFAULT_1, p = 0.75, list = FALSE)
X_train <- X[train_indices, ]
X_test <- X[-train_indices, ]
y_train <- y[train_indices,]
y_test <- y[-train_indices,]
X_train$`EMPLOYMENT_TYPE_Self employed`[is.na(X_train$`EMPLOYMENT_TYPE_Self employed`)] <- 0
X_test$`EMPLOYMENT_TYPE_Self employed`[is.na(X_test$`EMPLOYMENT_TYPE_Self employed`)] <- 0
X <- cbind(X_train, LOAN_DEFAULT = y_train)
X <- as.data.frame(X)
not_fraud <- X %>% filter(LOAN_DEFAULT_1 == 0)
fraud <- X %>% filter(LOAN_DEFAULT_1 == 1)
# Print class distribution
cat("Distribution before resampling:\n")
## Distribution before resampling:
cat("Non-fraud samples:", nrow(not_fraud), "\n")
## Non-fraud samples: 136741
cat("Fraud samples:", nrow(fraud), "\n")
## Fraud samples: 38125
# Load required libraries
set.seed(27)
# Upsample the minority class (fraud) with replacement
fraud_upsampled <- fraud %>%
slice_sample(n = nrow(not_fraud), replace = TRUE)
# Combine majority class and upsampled minority class
upsampled <- bind_rows(not_fraud, fraud_upsampled)
# Check new class counts
upsampled_counts <- upsampled %>%
count(LOAN_DEFAULT_1)
print(upsampled_counts)
## LOAN_DEFAULT_1 n
## 1 0 136741
## 2 1 136741
# Separate features and target
y_train <- upsampled$LOAN_DEFAULT_1
X_train <- upsampled %>% select(-LOAN_DEFAULT_1)
Decision Tree Classifier
set.seed(101)
# Train the decision tree
dtree <- rpart(
formula = y_train ~ .,
data = cbind(X_train, y_train = y_train),
method = "class",
control = rpart.control(
maxdepth = 10,
minsplit = 60,
minbucket = 30,
cp = 0.001
)
)
# Predict on test set
dtree_pred <- predict(dtree, X_test, type = "class")
# Ensure predictions and actual values are factors with the same levels
y_test_factor <- factor(y_test$LOAN_DEFAULT_1)
dtree_pred_factor <- factor(dtree_pred, levels = levels(y_test_factor))
# Calculate confusion matrix
conf_matrix <- confusionMatrix(dtree_pred_factor, y_test_factor)
print(conf_matrix$table)
## Reference
## Prediction 0 1
## 0 25113 4663
## 1 20689 7823
# Calculate accuracy
accuracy <- conf_matrix$overall["Accuracy"]
print(paste0(round(accuracy * 100, 2), "%"))
## [1] "56.51%"
# Extract and print the metrics (equivalent to the Python code)
# Accuracy
cat("Accuracy of model", conf_matrix$overall["Accuracy"], "\n")
## Accuracy of model 0.5650563
# F1 Score
cat("F1 Score", conf_matrix$byClass["F1"], "\n")
## F1 Score 0.6645585
# Recall Score
cat("Recall Score", conf_matrix$byClass["Sensitivity"], "\n")
## Recall Score 0.5482948
# Balanced Accuracy Score
cat("Balanced Accuracy Score", conf_matrix$byClass["Balanced Accuracy"], "\n")
## Balanced Accuracy Score 0.5874183
Stochastic Gradient Descent with Modified Huber Loss
# Fit the model
model <- sgd_modified_huber(
X = X_train,
y = y_train,
learning_rate = 0.01,
epochs = 100,
batch_size = 32,
alpha = 0.0001,
shuffle = TRUE,
random_seed = 101,
verbose = TRUE
)
## Epoch 10/100 - Loss: 0.9528
## Epoch 20/100 - Loss: 0.9527
## Epoch 30/100 - Loss: 0.9529
## Epoch 40/100 - Loss: 0.9528
## Epoch 50/100 - Loss: 0.9528
## Epoch 60/100 - Loss: 0.9526
## Epoch 70/100 - Loss: 0.9529
## Epoch 80/100 - Loss: 0.9527
## Epoch 90/100 - Loss: 0.9526
## Epoch 100/100 - Loss: 0.9527
#
# Make predictions
y_pred <- predict_sgd(X_test, model)
conf_matrix <- confusionMatrix(factor(y_pred, levels = c(0,1)), y_test_factor)
print(conf_matrix$table)
## Reference
## Prediction 0 1
## 0 24190 4317
## 1 21612 8169
print(paste0(round(conf_matrix$overall["Accuracy"], 2) * 100, "%"))
## [1] "56%"
# Extract and print the metrics (equivalent to the Python code)
# Accuracy
cat("Accuracy of model", conf_matrix$overall["Accuracy"], "\n")
## Accuracy of model 0.5551572
# F1 Score
cat("F1 Score", conf_matrix$byClass["F1"], "\n")
## F1 Score 0.6510651
# Recall Score
cat("Recall Score", conf_matrix$byClass["Sensitivity"], "\n")
## Recall Score 0.5281429
# Balanced Accuracy Score
cat("Balanced Accuracy Score", conf_matrix$byClass["Balanced Accuracy"], "\n")
## Balanced Accuracy Score 0.5911978
Random Forest Classifier
set.seed(42)
if(!is.factor(y_train)) {
y_train_factor <- factor(y_train)
} else {
y_train_factor <- y_train
}
# Train the Random Forest model
rfc <- randomForest(
x = X_train,
y = y_train_factor,
ntree = 10,
importance = TRUE
)
# Predict on test set
rfc_pred <- predict(rfc, X_test)
rfc_pred_factor <- factor(rfc_pred, levels = levels(y_test_factor))
y_test_factor <- factor(y_test$LOAN_DEFAULT_1)
# Calculate confusion matrix
cm_upsampling_rf <- confusionMatrix(rfc_pred_factor, y_test_factor)
saveRDS(cm_upsampling_rf, "cm_upsampling_rf.rds")
print(cm_upsampling_rf$table)
## Reference
## Prediction 0 1
## 0 31983 6946
## 1 13819 5540
# Calculate accuracy and print as percentage
accuracy <- cm_upsampling_rf$overall["Accuracy"]
print(paste0("Accuracy: ", round(accuracy * 100, 2), "%"))
## [1] "Accuracy: 64.38%"
# Extract and print the metrics
# Accuracy
cat("Accuracy of model", cm_upsampling_rf$overall["Accuracy"], "\n")
## Accuracy of model 0.6437517
# F1 Score
cat("F1 Score", cm_upsampling_rf$byClass["F1"], "\n")
## F1 Score 0.7549303
# Recall Score
cat("Recall Score", cm_upsampling_rf$byClass["Sensitivity"], "\n")
## Recall Score 0.6982883
# Balanced Accuracy Score
cat("Balanced Accuracy Score", cm_upsampling_rf$byClass["Balanced Accuracy"], "\n")
## Balanced Accuracy Score 0.5709926
With upsampling procedure applied, the best results, considering Accuracy, F1 Score and Recall, were obtained by Random Forest Classifier model.
Undersampling can be defined as removing some observations of the majority class. Undersampling can be a good choice when you have a ton of data -think millions of rows. But a drawback is that we are removing information that may be valuable. This could lead to underfitting and poor generalization to the test set.
y <- train_dummy["LOAN_DEFAULT_1"]
X <- train_dummy[, setdiff(names(train_dummy), "LOAN_DEFAULT_1")]
set.seed(27)
train_indices <- createDataPartition(y$LOAN_DEFAULT_1, p = 0.75, list = FALSE)
X_train <- X[train_indices, ]
X_test <- X[-train_indices, ]
y_train <- y[train_indices,]
y_test <- y[-train_indices,]
X_train$`EMPLOYMENT_TYPE_Self employed`[is.na(X_train$`EMPLOYMENT_TYPE_Self employed`)] <- 0
X_test$`EMPLOYMENT_TYPE_Self employed`[is.na(X_test$`EMPLOYMENT_TYPE_Self employed`)] <- 0
X <- cbind(X_train, LOAN_DEFAULT = y_train)
X <- as.data.frame(X)
not_fraud <- X %>% filter(LOAN_DEFAULT_1 == 0)
fraud <- X %>% filter(LOAN_DEFAULT_1 == 1)
set.seed(27)
# Upsample the minority class (fraud) with replacement
fraud_downsampled <- fraud %>%
slice_sample(n = nrow(not_fraud), replace = FALSE)
# Combine majority class and upsampled minority class
downsampled <- bind_rows(not_fraud, fraud_downsampled)
# Check new class counts
downsampled_counts <- downsampled %>%
count(LOAN_DEFAULT_1)
print(downsampled_counts)
## LOAN_DEFAULT_1 n
## 1 0 136741
## 2 1 38125
y_train <- downsampled$LOAN_DEFAULT_1
X_train <- downsampled %>% select(-LOAN_DEFAULT_1)
Decision Tree Classifier
set.seed(101)
# Train the decision tree
dtree <- rpart(
formula = y_train ~ .,
data = cbind(X_train, y_train = y_train),
method = "class",
control = rpart.control(
maxdepth = 10,
minsplit = 60,
minbucket = 30,
cp = 0.001
)
)
# Predict on test set
dtree_pred <- predict(dtree, X_test, type = "class")
y_test_factor <- factor(y_test$LOAN_DEFAULT_1)
dtree_pred_factor <- factor(dtree_pred, levels = levels(y_test_factor))
# Calculate confusion matrix
conf_matrix <- confusionMatrix(dtree_pred_factor, y_test_factor)
print(conf_matrix$table)
## Reference
## Prediction 0 1
## 0 45802 12486
## 1 0 0
# Calculate accuracy
accuracy <- conf_matrix$overall["Accuracy"]
print(paste0(round(accuracy * 100, 2), "%"))
## [1] "78.58%"
# Extract and print the metrics
# Accuracy
cat("Accuracy of model", conf_matrix$overall["Accuracy"], "\n")
## Accuracy of model 0.7857878
# F1 Score
cat("F1 Score", conf_matrix$byClass["F1"], "\n")
## F1 Score 0.8800461
# Recall Score
cat("Recall Score", conf_matrix$byClass["Sensitivity"], "\n")
## Recall Score 1
# Balanced Accuracy Score
cat("Balanced Accuracy Score", conf_matrix$byClass["Balanced Accuracy"], "\n")
## Balanced Accuracy Score 0.5
Stochastic Gradient Descent with Modified Huber Loss
model <- sgd_modified_huber(
X = X_train,
y = y_train,
learning_rate = 0.01,
epochs = 100,
batch_size = 32,
alpha = 0.0001,
shuffle = TRUE,
random_seed = 101,
verbose = TRUE
)
## Epoch 10/100 - Loss: 0.6618
## Epoch 20/100 - Loss: 0.6617
## Epoch 30/100 - Loss: 0.6620
## Epoch 40/100 - Loss: 0.6615
## Epoch 50/100 - Loss: 0.6621
## Epoch 60/100 - Loss: 0.6616
## Epoch 70/100 - Loss: 0.6619
## Epoch 80/100 - Loss: 0.6619
## Epoch 90/100 - Loss: 0.6616
## Epoch 100/100 - Loss: 0.6619
y_pred <- predict_sgd(X_test, model)
conf_matrix <- confusionMatrix(factor(y_pred, levels = c(0,1)), y_test_factor)
print(conf_matrix$table)
## Reference
## Prediction 0 1
## 0 45710 12415
## 1 92 71
print(paste0(round(conf_matrix$overall["Accuracy"], 2) * 100, "%"))
## [1] "79%"
# Extract and print the metrics
# Accuracy
cat("Accuracy of model", conf_matrix$overall["Accuracy"], "\n")
## Accuracy of model 0.7854275
# F1 Score
cat("F1 Score", conf_matrix$byClass["F1"], "\n")
## F1 Score 0.8796559
# Recall Score
cat("Recall Score", conf_matrix$byClass["Sensitivity"], "\n")
## Recall Score 0.9979914
# Balanced Accuracy Score
cat("Balanced Accuracy Score", conf_matrix$byClass["Balanced Accuracy"], "\n")
## Balanced Accuracy Score 0.5018389
Random Forest Classifier
set.seed(42)
if(!is.factor(y_train)) {
y_train_factor <- factor(y_train)
} else {
y_train_factor <- y_train
}
# Train the Random Forest model
rfc <- randomForest(
x = X_train,
y = y_train_factor,
ntree = 10,
importance = TRUE
)
# Predict on test set
rfc_pred <- predict(rfc, X_test)
rfc_pred_factor <- factor(rfc_pred, levels = levels(y_test_factor))
y_test_factor <- factor(y_test$LOAN_DEFAULT_1)
# Calculate confusion matrix
cm_downsampling_rf <- confusionMatrix(rfc_pred_factor, y_test_factor)
saveRDS(cm_downsampling_rf, "cm_downsampling_rf.rds")
print(cm_downsampling_rf$table)
## Reference
## Prediction 0 1
## 0 44965 12068
## 1 837 418
accuracy <- cm_downsampling_rf$overall["Accuracy"]
print(paste0("Accuracy: ", round(accuracy * 100, 2), "%"))
## [1] "Accuracy: 77.86%"
# Extract and print the metrics
# Accuracy
cat("Accuracy of model", cm_downsampling_rf$overall["Accuracy"], "\n")
## Accuracy of model 0.7785994
# F1 Score
cat("F1 Score", cm_downsampling_rf$byClass["F1"], "\n")
## F1 Score 0.8745077
# Recall Score
cat("Recall Score", cm_downsampling_rf$byClass["Sensitivity"], "\n")
## Recall Score 0.9817257
# Balanced Accuracy Score
cat("Balanced Accuracy Score", cm_downsampling_rf$byClass["Balanced Accuracy"], "\n")
## Balanced Accuracy Score 0.5076016
With downsamling procedure applied, based on the lowest number of false negative the Random Forest Classifier obtained the best results.
Principal Component Analysis (PCA) is a dimensionality reduction technique used in statistics and machine learning to reduce the number of features (dimensions) in a dataset while preserving as much of its variance (information) as possible.
y <- train_dummy["LOAN_DEFAULT_1"]
X <- train_dummy[, setdiff(names(train_dummy), "LOAN_DEFAULT_1")]
X$`EMPLOYMENT_TYPE_Self employed`[is.na(X$`EMPLOYMENT_TYPE_Self employed`)] <- 0
set.seed(27)
train_indices <- createDataPartition(y$LOAN_DEFAULT_1, p = 0.75, list = FALSE)
X_train <- X[train_indices, ]
X_test <- X[-train_indices, ]
y_train <- y[train_indices,]
y_test <- y[-train_indices,]
X_train$`EMPLOYMENT_TYPE_Self employed`[is.na(X_train$`EMPLOYMENT_TYPE_Self employed`)] <- 0
X_test$`EMPLOYMENT_TYPE_Self employed`[is.na(X_test$`EMPLOYMENT_TYPE_Self employed`)] <- 0
pca_result <- prcomp(X_train,
center = TRUE,
scale. = FALSE)
variance_explained <- pca_result$sdev^2 / sum(pca_result$sdev^2)
plot(variance_explained,
xlab = "Principal Component",
ylab = "Proportion of Variance Explained",
type = "b",
main = "Scree Plot")
# Cumulative variance explained
cumulative_variance <- cumsum(variance_explained)
plot(cumulative_variance,
xlab = "Principal Component",
ylab = "Cumulative Proportion of Variance Explained",
type = "b",
main = "Cumulative Variance Explained")
We can see that 17 components are enough to explain most of the variance. Let’s see the PCA with 17 components then.
pca_result <- prcomp(X_train,
center = TRUE,
scale. = FALSE,
rank. = 17)
X_train_PCA <- as.data.frame(pca_result$x)
Decision Tree Classifier
set.seed(101)
y_train <- y_train$LOAN_DEFAULT_1
# Train the decision tree
dtree <- rpart(
formula = y_train ~ .,
data = cbind(X_train_PCA, y_train = y_train),
method = "class",
control = rpart.control(
maxdepth = 10,
minsplit = 60,
minbucket = 30,
cp = 0.001
)
)
# Predict on test set
X_test_PCA <- predict(pca_result, newdata = X_test)
X_test_PCA <- as.data.frame(X_test_PCA)
dtree_pred <- predict(dtree, X_test_PCA, type = "class")
y_test_factor <- factor(y_test$LOAN_DEFAULT_1)
dtree_pred_factor <- factor(dtree_pred, levels = levels(y_test_factor))
# Calculate confusion matrix
cm_pca <- confusionMatrix(dtree_pred_factor, y_test_factor)
print(cm_pca$table)
## Reference
## Prediction 0 1
## 0 45802 12486
## 1 0 0
# Calculate accuracy
accuracy <- cm_pca$overall["Accuracy"]
print(paste0(round(accuracy * 100, 2), "%"))
## [1] "78.58%"
# Accuracy
cat("Accuracy of model", cm_pca$overall["Accuracy"], "\n")
## Accuracy of model 0.7857878
# F1 Score
cat("F1 Score", cm_pca$byClass["F1"], "\n")
## F1 Score 0.8800461
# Recall Score
cat("Recall Score", cm_pca$byClass["Sensitivity"], "\n")
## Recall Score 1
# Balanced Accuracy Score
cat("Balanced Accuracy Score", cm_pca$byClass["Balanced Accuracy"], "\n")
## Balanced Accuracy Score 0.5
extract_caret_metrics <- function(conf_matrix_obj) {
# Extract the standard metrics provided by caret
accuracy <- conf_matrix_obj$overall["Accuracy"]
# Extract class-specific statistics
class_stats <- conf_matrix_obj$byClass
# For binary classification, we're interested in the positive class metrics
precision <- class_stats["Pos Pred Value"] # Precision = Positive Predictive Value
recall <- class_stats["Sensitivity"] # Recall = Sensitivity
specificity <- class_stats["Specificity"]
f1_score <- class_stats["F1"]
return(data.frame(
Accuracy = accuracy,
Precision = precision,
Recall = recall,
F1_Score = f1_score,
Specificity = specificity
))
}
# Function to read RDS files containing caret confusionMatrix objects
read_caret_model_results <- function(file_path, model_name) {
# Read the RDS file
result <- readRDS(file_path)
# Check if it's a caret confusionMatrix object
if (class(result)[1] == "confusionMatrix") {
conf_matrix_obj <- result
} else {
stop(paste("File does not contain a caret confusionMatrix object:", file_path))
}
# Extract the actual confusion matrix table
conf_matrix <- conf_matrix_obj$table
# Extract metrics
metrics <- extract_caret_metrics(conf_matrix_obj)
return(list(
model_name = model_name,
confusion_matrix = conf_matrix,
conf_matrix_obj = conf_matrix_obj, # Keep the full object for reference
metrics = metrics
))
}
# Function to plot confusion matrix
plot_confusion_matrix <- function(conf_matrix, title) {
# Convert matrix to data frame for plotting
conf_df <- as.data.frame(as.table(conf_matrix))
names(conf_df) <- c("Actual", "Predicted", "Frequency")
# Calculate percentages for text labels
total <- sum(conf_df$Frequency)
conf_df$Percentage <- conf_df$Frequency / total * 100
# Calculate row totals for row-wise percentages
row_totals <- aggregate(Frequency ~ Actual, conf_df, sum)
conf_df <- merge(conf_df, row_totals, by = "Actual", suffixes = c("", "_total"))
conf_df$Row_Percentage <- conf_df$Frequency / conf_df$Frequency_total * 100
# Create plot
ggplot(conf_df, aes(x = Predicted, y = Actual, fill = Frequency)) +
geom_tile() +
geom_text(aes(label = sprintf("%.0f\n(%.1f%%)", Frequency, Row_Percentage)),
color = "white", size = 4) +
scale_fill_gradient(low = "steelblue", high = "darkblue") +
labs(
title = title,
x = "Predicted Class",
y = "Actual Class"
) +
theme_minimal() +
theme(
plot.title = element_text(face = "bold", hjust = 0.5),
axis.text = element_text(size = 12),
legend.position = "none"
)
}
# Function to plot comparison of metrics across models
plot_metrics_comparison <- function(all_metrics) {
# Convert to long format for plotting
metrics_long <- all_metrics %>%
pivot_longer(cols = c("Accuracy", "Precision", "Recall", "F1_Score", "Specificity"),
names_to = "Metric", values_to = "Value")
# Create the plot
ggplot(metrics_long, aes(x = Model, y = Value, fill = Model)) +
geom_bar(stat = "identity") +
geom_text(aes(label = sprintf("%.3f", Value)), position = position_stack(vjust = 0.5),
color = "white") +
facet_wrap(~ Metric, scales = "free_y") +
scale_y_continuous(labels = percent_format(accuracy = 1)) +
labs(
title = "Model Comparison by Performance Metrics",
x = "Model",
y = "Score"
) +
theme_minimal() +
theme(
plot.title = element_text(face = "bold", hjust = 0.5, size = 16),
axis.text.x = element_text(angle = 45, hjust = 1, size = 10),
legend.position = "none",
strip.text = element_text(face = "bold", size = 12)
)
}
# Main function to compare models
compare_caret_models <- function(file_paths, model_names) {
# Check that the lists have the same length
if (length(file_paths) != length(model_names)) {
stop("The number of file paths and model names must be the same")
}
# Read all model results
model_results <- list()
for (i in 1:length(file_paths)) {
tryCatch({
model_results[[i]] <- read_caret_model_results(file_paths[i], model_names[i])
cat("Successfully loaded: ", model_names[i], "\n")
}, error = function(e) {
cat("Error loading ", model_names[i], ": ", e$message, "\n")
})
}
# Extract metrics for all models
all_metrics <- data.frame()
for (result in model_results) {
metrics_df <- result$metrics
metrics_df$Model <- result$model_name
all_metrics <- rbind(all_metrics, metrics_df)
}
# Create plots for confusion matrices
conf_plots <- list()
for (i in 1:length(model_results)) {
conf_plots[[i]] <- plot_confusion_matrix(
model_results[[i]]$confusion_matrix,
paste("Confusion Matrix -", model_results[[i]]$model_name)
)
}
# Plot metrics comparison
metrics_plot <- plot_metrics_comparison(all_metrics)
# Print metrics table sorted by F1 Score
cat("\n================= MODEL COMPARISON SUMMARY =================\n")
print(all_metrics %>%
arrange(desc(F1_Score)) %>%
select(Model, Accuracy, Precision, Recall, F1_Score, Specificity) %>%
mutate_if(is.numeric, round, 4))
# Create a summary table for statistical comparison
cat("\n================= DETAILED MODEL STATISTICS =================\n")
for (i in 1:length(model_results)) {
cat("\n", model_names[i], ":\n")
print(model_results[[i]]$conf_matrix_obj$overall)
cat("\nClass-specific statistics:\n")
print(model_results[[i]]$conf_matrix_obj$byClass)
cat("\n-------------------------------------------\n")
}
# Determine best model based on F1 Score
best_model <- all_metrics %>%
arrange(desc(F1_Score)) %>%
dplyr::slice(1)
cat("\n====================== BEST MODEL ======================\n")
cat("Based on F1 Score: ", best_model$Model, "\n")
cat("F1 Score: ", round(best_model$F1_Score, 4), "\n")
cat("Accuracy: ", round(best_model$Accuracy, 4), "\n")
cat("Recall (Sensitivity): ", round(best_model$Recall, 4), "\n")
cat("Precision (Positive Predictive Value): ", round(best_model$Precision, 4), "\n")
# Return results
return(list(
metrics = all_metrics,
confusions_plots = conf_plots,
metrics_plot = metrics_plot,
best_model = best_model$Model,
model_results = model_results # Return full model results for additional analysis
))
}
# Function to process all RDS files in a directory
batch_process_caret_models <- function(directory_path) {
# Get all RDS files in the directory
rds_files <- list.files(directory_path, pattern = "\\.rds$", full.names = TRUE)
if (length(rds_files) == 0) {
cat("No RDS files found in directory:", directory_path, "\n")
return(NULL)
}
# Extract model names from filenames (remove path and extension)
model_names <- gsub("\\.rds$", "", basename(rds_files))
# Compare models
results <- compare_caret_models(rds_files, model_names)
# Save results automatically
output_dir <- file.path(directory_path, "model_comparison_results")
if (!dir.exists(output_dir)) {
dir.create(output_dir)
}
# Save metrics comparison plot
metrics_filename <- file.path(output_dir, "metrics_comparison.png")
ggsave(metrics_filename, results$metrics_plot, width = 10, height = 8)
# Save confusion matrix plots
for (i in 1:length(results$confusion_plots)) {
conf_filename <- file.path(output_dir, paste0("confusion_matrix_", model_names[i], ".png"))
ggsave(conf_filename, results$confusion_plots[[i]], width = 7, height = 6)
}
# Create a combined confusion matrix plot
#combined_plot <- gridExtra::grid.arrange(grobs = results$confusion_plots, ncol = 2)
#ggsave(file.path(output_dir, "combined_confusion_matrices.png"), combined_plot, width = 14, height = 10)
# Save metrics table
write.csv(results$metrics, file.path(output_dir, "metrics_summary.csv"), row.names = FALSE)
cat("\nResults saved to directory:", output_dir, "\n")
return(results)
}
path <- getwd()
results <- batch_process_caret_models(path)
## Successfully loaded: cm_downsampling_rf
## Successfully loaded: cm_dt
## Successfully loaded: cm_log
## Successfully loaded: cm_nb
## Successfully loaded: cm_rf
## Successfully loaded: cm_sgd
## Successfully loaded: cm_smote_dt
## Successfully loaded: cm_upsampling_rf
## Successfully loaded: cm_xgb
##
## ================= MODEL COMPARISON SUMMARY =================
## Model Accuracy Precision Recall F1_Score Specificity
## Accuracy2 cm_log 0.7815 0.7821 0.9987 0.8772 0.0052
## Accuracy8 cm_xgb 0.7808 0.7837 0.9938 0.8763 0.0194
## Accuracy cm_downsampling_rf 0.7786 0.7884 0.9817 0.8745 0.0335
## Accuracy4 cm_rf 0.7758 0.7849 0.9822 0.8726 0.0377
## Accuracy6 cm_smote_dt 0.7676 0.7928 0.9533 0.8657 0.0863
## Accuracy7 cm_upsampling_rf 0.6438 0.8216 0.6983 0.7549 0.4437
## Accuracy3 cm_nb 0.4242 0.8404 0.3249 0.4686 0.7794
## Accuracy1 cm_dt 0.7789 0.3819 0.0190 0.0363 0.9914
## Accuracy5 cm_sgd 0.7814 0.0000 0.0000 NaN 1.0000
##
## ================= DETAILED MODEL STATISTICS =================
##
## cm_downsampling_rf :
## Accuracy Kappa AccuracyLower AccuracyUpper AccuracyNull
## 0.77859937 0.02259487 0.77520638 0.78196460 0.78578781
## AccuracyPValue McnemarPValue
## 0.99998774 0.00000000
##
## Class-specific statistics:
## Sensitivity Specificity Pos Pred Value
## 0.98172569 0.03347749 0.78840321
## Neg Pred Value Precision Recall
## 0.33306773 0.78840321 0.98172569
## F1 Prevalence Detection Rate
## 0.87450771 0.78578781 0.77142808
## Detection Prevalence Balanced Accuracy
## 0.97846898 0.50760159
##
## -------------------------------------------
##
## cm_dt :
## Accuracy Kappa AccuracyLower AccuracyUpper AccuracyNull
## 0.77887626 0.01583981 0.77508148 0.78263629 0.78144971
## AccuracyPValue McnemarPValue
## 0.91140137 0.00000000
##
## Class-specific statistics:
## Sensitivity Specificity Pos Pred Value
## 0.019036405 0.991382859 0.381889764
## Neg Pred Value Precision Recall
## 0.783248775 0.381889764 0.019036405
## F1 Prevalence Detection Rate
## 0.036265072 0.218550290 0.004160412
## Detection Prevalence Balanced Accuracy
## 0.010894274 0.505209632
##
## -------------------------------------------
##
## cm_log :
## Accuracy Kappa AccuracyLower AccuracyUpper AccuracyNull
## 0.781535492 0.005992706 0.777756880 0.785279023 0.781449710
## AccuracyPValue McnemarPValue
## 0.484775934 0.000000000
##
## Class-specific statistics:
## Sensitivity Specificity Pos Pred Value
## 0.998655287 0.005200667 0.782109697
## Neg Pred Value Precision Recall
## 0.519607843 0.782109697 0.998655287
## F1 Prevalence Detection Rate
## 0.877216243 0.781449710 0.780398885
## Detection Prevalence Balanced Accuracy
## 0.997812567 0.501927977
##
## -------------------------------------------
##
## cm_nb :
## Accuracy Kappa AccuracyLower AccuracyUpper AccuracyNull
## 0.42421188 0.05826022 0.41972021 0.42871299 0.78144971
## AccuracyPValue McnemarPValue
## 1.00000000 0.00000000
##
## Class-specific statistics:
## Sensitivity Specificity Pos Pred Value
## 0.3248717 0.7794132 0.8404089
## Neg Pred Value Precision Recall
## 0.2440696 0.8404089 0.3248717
## F1 Prevalence Detection Rate
## 0.4685997 0.7814497 0.2538709
## Detection Prevalence Balanced Accuracy
## 0.3020802 0.5521425
##
## -------------------------------------------
##
## cm_rf :
## Accuracy Kappa AccuracyLower AccuracyUpper AccuracyNull
## 0.77580956 0.02946235 0.77199642 0.77958834 0.78144971
## AccuracyPValue McnemarPValue
## 0.99838371 0.00000000
##
## Class-specific statistics:
## Sensitivity Specificity Pos Pred Value
## 0.98224430 0.03768031 0.78492949
## Neg Pred Value Precision Recall
## 0.37245393 0.78492949 0.98224430
## F1 Prevalence Detection Rate
## 0.87257125 0.78144971 0.76757452
## Detection Prevalence Balanced Accuracy
## 0.97788977 0.50996230
##
## -------------------------------------------
##
## cm_sgd :
## Accuracy Kappa AccuracyLower AccuracyUpper AccuracyNull
## 7.814283e-01 -4.288847e-05 7.776490e-01 7.851725e-01 7.814497e-01
## AccuracyPValue McnemarPValue
## 5.071246e-01 0.000000e+00
##
## Class-specific statistics:
## Sensitivity Specificity Pos Pred Value
## 0.000000e+00 9.999726e-01 0.000000e+00
## Neg Pred Value Precision Recall
## 7.814450e-01 0.000000e+00 0.000000e+00
## F1 Prevalence Detection Rate
## NaN 2.185503e-01 0.000000e+00
## Detection Prevalence Balanced Accuracy
## 2.144542e-05 4.999863e-01
##
## -------------------------------------------
##
## cm_smote_dt :
## Accuracy Kappa AccuracyLower AccuracyUpper AccuracyNull
## 0.76755078 0.05423058 0.76409986 0.77097504 0.78578781
## AccuracyPValue McnemarPValue
## 1.00000000 0.00000000
##
## Class-specific statistics:
## Sensitivity Specificity Pos Pred Value
## 0.9532553 0.0863367 0.7928417
## Neg Pred Value Precision Recall
## 0.3348866 0.7928417 0.9532553
## F1 Prevalence Detection Rate
## 0.8656799 0.7857878 0.7490564
## Detection Prevalence Balanced Accuracy
## 0.9447742 0.5197960
##
## -------------------------------------------
##
## cm_upsampling_rf :
## Accuracy Kappa AccuracyLower AccuracyUpper AccuracyNull
## 0.6437517 0.1183014 0.6398483 0.6476408 0.7857878
## AccuracyPValue McnemarPValue
## 1.0000000 0.0000000
##
## Class-specific statistics:
## Sensitivity Specificity Pos Pred Value
## 0.6982883 0.4436969 0.8215726
## Neg Pred Value Precision Recall
## 0.2861718 0.8215726 0.6982883
## F1 Prevalence Detection Rate
## 0.7549303 0.7857878 0.5487064
## Detection Prevalence Balanced Accuracy
## 0.6678733 0.5709926
##
## -------------------------------------------
##
## cm_xgb :
## Accuracy Kappa AccuracyLower AccuracyUpper AccuracyNull
## 0.78082779 0.02015605 0.77704486 0.78457574 0.78144971
## AccuracyPValue McnemarPValue
## 0.62986861 0.00000000
##
## Class-specific statistics:
## Sensitivity Specificity Pos Pred Value
## 0.99377041 0.01942891 0.78372471
## Neg Pred Value Precision Recall
## 0.46588235 0.78372471 0.99377041
## F1 Prevalence Detection Rate
## 0.87633706 0.78144971 0.77658160
## Detection Prevalence Balanced Accuracy
## 0.99088570 0.50659966
##
## -------------------------------------------
##
## ====================== BEST MODEL ======================
## Based on F1 Score: cm_log
## F1 Score: 0.8772
## Accuracy: 0.7815
## Recall (Sensitivity): 0.9987
## Precision (Positive Predictive Value): 0.7821
##
## Results saved to directory: /Users/joannamisiak/Desktop/STUDIA/Reproducible_Research/RR_project/model_comparison_results
Based
on our code the best model results were obtained by the Logistic
Regression Model. Original study’s best model was
Random Forest model with applied SMOTE. The results are
much different, which can be explained by different specification of the
implemented functions used for modelling.